The rapid evolution of data scraping technologies is forcing us to reconsider some evergreen questions: What content is public? What is private? How do we decide?
For the most recent event in CLTC’s series on data scraping and privacy, Paul Ohm, Professor of Law at the Georgetown University Law Center, discussed an emerging and underappreciated approach to digital system design called “desirable inefficiency.”
Designers have turned to desirable inefficiency when the efficient alternative fails to provide or protect some essential human value, such as fairness or trust. Desirable inefficiency is an example of a design pattern that engineers have organically and voluntarily adopted to protect and make space for human values.
Ohm spoke with Tejas Narechania, Robert and Nanci Corson Assistant Professor of Law at the University of California, Berkeley, School of Law. They discussed desirable inefficiency as it relates to the role of friction in privacy and the emergence of more powerful data scraping technology.
This conversation was the fourth in a series that CLTC has convened this year on different facets of data scraping, as the evolution of technology forces us to reconsider questions about what is public and what is private. (Recaps of prior events in this series are available through the following links: Privacy and Data Scraping, Perspectives from Latin America; Data Scraping for Research Purposes; and Data Scraping & the Courts: State of Play with the CFAA.)
Excerpts from the conversation are below. (Content has been edited for length and clarity.)
Narechania: Different courts have thought about what data scraping is in slightly different terms. The District Court in Minnesota has referred to data scraping as quite simply the pulling of information off a website, while others have suggested that scraping involves extracting data from a website and copying it into a structured format, allowing for data manipulation and analysis — and that scraping can be done both manually and by a bot. How has scraping changed the way we think about digital system design, how has automated scraping made things more efficient, and do why would we want things to be less efficient?
Ohm: I’m sort of a professional web scraper. I’ve been scraping the web for as long as there has been the web. And I’ve trained more than a generation of law students on how to use some rudimentary computer skills, and part of that has been how to scrape a website in Python.
Scraping is very much an arms race between the people who are publishing this information and the people who are scraping the information. It is counterintuitive that we have these arms races that are spinning off inefficient technologies. The very first thing you learn in computer science education is that you’ve got to make your code more efficient. The study of algorithms is really the study of code and efficiency. The bottom line is, as we scale systems to be larger and to include more data, how can we keep things computationally feasible? How can we make sure that we can process things at a very large scale?
I had a graduate student named Jonathan Frankel (who is now a machine learning expert of some note). About six or seven years ago, we put the Bitcoin proof of work algorithm on the board. And we said, how does this make sense? This is such a wasteful process. It’s now been replicated millions of times around the world. The fact that Bitcoin at its heart has an inefficient piece is now really well known because we’ve talked about the environmental impact of cryptocurrency. But seven years ago, this was still kind of a new idea. And together, we wrote an article for the Florida Law Review. And that’s why we decided to frame this as we have, as “desirable inefficiency.” We started to explore why computer scientists would bake something into a system that’s slow.
The most interesting thing for me about that project is that we started to see it everywhere. That article is a catalogue of nine examples of important systems that have a “slowing mechanism,” or an illogically convoluted spatial algorithm, for reasons that you wouldn’t expect if you were in that undergraduate computer science classroom. And it’s led me on this long path of thinking about desirable inefficiency, not only for that paper, but in my current work as well.
Narechania: Tell me more about these nine systems, and about why programmers bake inefficiency into problems they are trying to solve.
Ohm: It connects deeply with privacy, which is the core of my work. Our thesis is that it results when a programmer is asked to inject some sort of human value that typically they’re not coding for. We code for speed all the time, we code to make sure that things work on slower processors all the time. In the olden days, we used to code for hard disk scarcity, because hard disks were expensive. They’re not so expensive anymore. Undergraduate education focuses on efficiency, but when there’s another human value — and the list is long — that’s when you might expect to see desirable inefficiency as a solution.
My favorite example is iEX [Investors Exchange], which is one of the recognized stock exchanges. iEX was made famous by Michael Lewis’s book Flash Boys. At iEX, there were flash traders, people who had privileged access to the network because they spent a lot of money. They needed to be in the right rack and the right server room in the right basement in New Jersey. They were able to step inside be in the middle of trades at a rate that is not only inhuman, but faster than all the other computers by all the other companies. And through that, they can use arbitrage and engage in an activity called “front running.”
iEX deemed that that was unfair, that there was this human value: fairness. It’s not the kind of thing we code for a lot. But that fairness was something they wanted to restore. One of the things I love most about this example is that the high-speed fast traders said, you’ve got this completely backwards. I’ve paid for high-speed access in this basement, you’re the one who’s acting unfair if you’ve decided on a desirably inefficient system to lock us out.
For purposes of this project, I don’t care who’s right or wrong. I think that’s what happens with human values. We bring values to the table, and then we disagree. But the most interesting thing about iEX is the way they implemented their solution. They literally took a shoebox-sized metal box. And then they wrapped 38 miles of fiber optic cable inside that box. So to trade on iEX, every single packet of information has to travel that 38 miles, and it slows down the trade by 350 microseconds, which they deemed to be exactly the right amount of time to make sure that trading is still efficient, but so that the these high-speed traders no longer have that unfair advantage. I’m so intrigued by this phenomenon.
Narechania: What’s interesting about the systems you’re talking about is that control over the competition among human values rests with programmers.
Ohm: You are preaching to the choir. I run something called the Tech and Society Initiative, and that name is quite intentional. We think you’re not doing your job if you’re thinking about tech without the society — if you’re not only thinking about human values, and having an opinion about who’s right or wrong, but also about the governance structures and the way power gets distributed in all sorts of kind of aberrational ways.
Programmers are entitled to control. The natural evolution of things gave programmers the control because they were the most worthy. As we put this article together, [Jonathan Frankel and I] were careful to say, this is step one in a longer project. Step one is, this really interesting thing with inefficiency is starting to happen spontaneously. It’s starting to happen from the technologists themselves. But where this project ends, and probably gets to pretty quickly, is governance. The word that’s in the title of this talk is friction. Introducing friction into the right place in the system is something that companies do, individual programmers and outside hackers do, and data scrapers do. Ultimately, I think governments and other civil society organizations should at least have a say in where that friction goes. But to understand that, we have to understand how this friction works and where it comes from.
Narechania: I agree that we need to spend some time understanding the technical systems to be able to design governance systems that help us resolve the competition among human values in a way that’s not simply, let the programmer decide. You say that civil society or companies should have a say in that friction. What does that look like?
Ohm: I am not a political theorist. But one thing that’s interesting for legal scholars like you and me — who started with a technical background — is that we can be a little less timid about proposing things that in some circles would be considered impolite. Right now, there is really this monopoly on design by the programmers. At the end of the day, a few founders of a few companies have a kind of disproportionate vote, if we want to put it in democratic terms, in the systems that govern our elections and govern our speech. If there’s a common thread to my work, it’s that we need to be able to recognize that there is another way. Given the kind of democratic society we happen to live in, there are avenues. There are statutes we can write, and there are rules we can write. And they may fundamentally shift what those technologists do or do not do. They may do it through incentives and nudges, or through old-fashioned command-and-control diktats: “you have to put friction here, you cannot put friction there.” Unless we have a very concrete, specific topic, it’s harder for me to go deeper than that. But I want to be known as one who’s willing to broaden the the design table to put a bunch of people at that table. Because right now, it’s a lonely little dinner party, and we should expand that.
Narechania: One of the critiques of large technology companies is that they control many of these systems, and they are so important to our lives, perhaps more democratic control over the ways that they function would be better for society. Rather than about the governance systems that you think we should have, what do you make of the ones that we do have — such as the Computer Fraud and Abuse Act (CFAA) or the California Consumer Privacy Act (CCPA)? Have they had the effect you’ve envisioned in changing the way that programmers or companies address system design?
Ohm: I started my career looking for hackers so we could prosecute them under the CFAA. I took a year off to go to the Federal Trade Commission and I helped them write a bunch of privacy regulations and worked on some cases. And I am now moonlighting part time with the Colorado Attorney General’s office. I’m not speaking on their behalf. None of these reflect the opinions of the Colorado Attorney General. But they’re writing privacy rules. And Colorado, in a very stealthy way, has a statute very much inspired by the CCPA. And I’m part of the team writing that, so call me biased if you want. But I am a real believer not only in the power of government to thoughtfully engage with the way tech is designed and evolves. I actually am probably more of a defender of our current toolkit than your average law professor in our fields. I’m not going to say the CFAA is a mess, or ECPA [the Electronic Communications Privacy Act] makes no sense, or the CCPA is a drop in the bucket, although I’ve heard friends and colleagues say all three of those things. None of them is perfect. It’s hard to write text that reflects the current and technical reality, it’s harder to keep text relevant with kind of the seismic changes we’ve seen.
I’m a big believer in delayed governance, where we set up first principles. And then we trust a regulatory body and expert agency to lend meaning to that, either through rule-making or enforcement decisions. I’m similarly a big believer in the strength of our federal judiciary to tease out super-hard technical questions, as long as the litigators are doing their jobs. Giving them open texture language can lead to good results. I would love to see creative thinking that goes beyond where any of the three of these laws have gone that really disrupt the polite, arm’s length deference to the tech industry.
Narechania: I agree that these legislative efforts, like most legislative efforts are good, but not perfect. And I am also a big believer in having our legislatures set basic ground rules and then letting those evolve and develop over time through skilled and politically accountable regulators, and in the courts. All of that is a nice reflection of the way our system is supposed to work. And by the way, you and I are both saying this after a four-year test of our faith in that.
Ohm: I agree that elections have consequences and that it is important to vote. And that political accountability has to cut both ways. There are a lot of pieces of conventional wisdom that you hear all the time about regulating tech that just I think are fundamentally flawed. Something like “judges can’t keep up with changing technology” — you hear that all the time in patent disputes. But when it comes to something like the CFAA, you also hear, people in Congress are too old to understand technology. That’s kind of an empirically true fact, but I don’t think what follows from that is that they’re unable to take a career of thinking about how to regulate industries and then apply it here.
In a small paper I wrote, we gained inspiration from the right to repair movement. This is very much on the friction theme. We argue that every smart home device should have a kill switch that restores its operation as a plain old thermostat, or a plain old fridge, and that there should be an affirmative statutory obligation for every large smart-home manufacturer in the US — or who wants to sell in the US — to build in a switch like that. I will be expensive, and it will be inefficient, and the price of these devices may go up a little. And I’m not even sure it’s going to be used by many consumers. But I believe that encouraging lawmakers to be proactive in the design of technical systems is so important that that’s a step in the right direction, just like the right to repair is as well.
Narechania: We do need our legislators to be thinking more proactively about, what do we want? What sorts of long-term responsibilities should be put on the manufacturers of connected devices, whether they are smart home devices, or biometric implants or anything else? Another question that we received in advance is, does encryption work against data scraping? What emerging methods and patterns can protect public data against data scraping? Maybe you can talk generally about the role of encryption in both scraping and friction generally.
Ohm: Let me say a little bit more about friction. When I say friction, I mean this in a capacious way. Friction takes a lot of forms. You can picture a speed bump. A speed bump doesn’t stop a car from passing through, it just limits the speed for lots of purposes. There are a lot of technological equivalents. One that’s related to encryption was made famous in the FBI-Apple fight from a few years back, which is that every time you fat-finger your password on your iPhone, it delays logging in, according to some set standard that was tuned by the Apple engineers. This friction has been highly studied and designed. The default at the time was that the delays got really long after about five false guesses. And on some versions of iOS, once you get to nine or 10, it basically locked you out of the phone. Depending on how you have the settings, it might even break your phone or delete all the content from your phone. This is highly designed friction. The human values I talked about are everywhere. This is about government surveillance. This is about identity theft. This is about trust. This is about security. There are lots of values-laden words that the Apple engineers decided to use through this encryption and unlocking scenario.
Encryption is really interesting, because we often think of encryption as a binary. You either have the passcode and you get in, or you don’t. But encryption, if you’ve sprinkled it in the right places, can actually be more like a speed bump. In the middle of the crypto wars is a fight over, should we have backdoored encryption protocols? [Madeleine Clare Elish] wrote a paper describing what she called “moral crumple zones.” The idea was, you could have robust, military-grade encryption, but if you had the right kind of key, that encryption would be crackable, but only at great expense. You could brute force it at the expense of like a million dollars. It described it in terms of dollars and cents, and not in number of years it would take a Pentium processor to attack it. I think there are a lot of people in the crypto wars who would find it deeply offensive that we would try and dumb down encryption in that way. But I thought it was such an interesting way to imagine a tuning dial that is democratically set that would allow the encryption to work, except when it didn’t because some judge signed some warrant that produced some key. The relationship between friction and encryption is super interesting. And there’s a lot there.
Data scraping is something that I’ve spent a lot of time doing, and I’ve run into what I would deem “friction responses to data scraping.” Before you get to legal remedies, this is just trying to make it easier to detect that the person who is acting is not the kind of person you want on your website, and then doing something to slow them down. One of the classic ones that comes up a lot in the jurisprudence is MAC address filtering. A MAC address is not quite immutable, but it’s an identifier that’s associated with the network interface card [NIC] on your computer. It’s a little bit more like a fingerprint, a unique ID, than an IP address, which could be shared with different devices. There have been cases where MAC addresses are visible to the person being scraped. They know that you are scraping just by the volume of your activity. And so they block that MAC address. That might feel like it’s not friction, that’s just security. But the thing about MAC addresses, even though they’re sort of immutable, is that they actually can be changed. You don’t have to be that sophisticated to change them. So it’s a kind of a roadblock or a speed bump, but one that a technically sophisticated data scraper can probably get around.
This is at the heart of one of the most tragic stories in all of data scraping history, which is the case of Aaron Swartz. Swartz was was scraping an archive of scholarly literature from a closet at MIT, and what MIT did on this long road to trying to figure out who this was and stop him is implement MAC address filtering. We can get to what the law says about this, but that’s friction. It’s not a perfect block, but it’s a decent block if Aaron Swartz wasn’t quite as technically sophisticated. It was tragic because he committed suicide before the prosecution finished its case. If it was someone less sophisticated, that might have been a roadblock. So what do we make of that in terms of the IT security side of things? And what do we make of that in terms of the legal remedy side of things, the morality side of things? It’s really a place where all of these issues come together.
Narechania: How do you balance the benefits of data scraping, including what you can learn from the data that you can scrape, with obligations of data privacy that end users have?
Ohm: Clearview AI is such a fascinating Rorschach test, because it puts that question front and center. Clearview of course is the company that scraped social media for images. It complicates the privacy story, because these were images that sort of in the public on social media. They used it as a massive training data set to create a really effective (at least according to their marketing) facial recognition system. And then they compounded the harm by flip flopping their business model a few times. At one point it was that they would sell this to anyone, and then later it was that they’re only going to sell this to the governments of the world — as if that made it better.
One way to respond to the Clearview story is to make it a data scraping story — that either the original sin or the best place where we could have stopped it is once we realize that this bot army was downloading not thousands, but hundreds of millions of images. And it was pre-labeled, so it was really good for the machine learning piece of this.
I won’t pretend like I’ve surveyed all my friends on Clearview, but I find that every time it comes up, it’s really hard to predict how the person in front of you will respond to it. Most often I’ve heard people say, it’s not a data scraping story at all. It’s a machine learning and ethical business model story. I think it’s a scraping story too. I think that, had we realized that it was the same company that without permission was downloading all of these images, my sense of imminent possible privacy violation would begin to at least out-tip whatever research benefit this company or entity was hoping to achieve, especially because of the scale.
So here’s where friction could come in. Let’s just make a sci-fi technology where if you scrape a few photos, there’s no friction imposed whatsoever. If you scrape 100, maybe we require the processor on your computer to work a little bit harder for each one after 100. And then once you get to 10,000, you’re forced to do some sort of Bitcoin-style proof-of-work math puzzle to get the next photo. There’d be arms races all the way down. If you could implement it, friction might separate — to use crude language — the “good” researchers and the “bad” researchers — good, meaning consistent with privacy considerations and the value of this research, versus not. It’s going to be crude. There will be false positives and false negatives. But what I love is that it makes what seems impossible now possible, which is why I think it might be a scraping story, and not just an ethical business model story.
Watch the rest of the conversation above or on YouTube.