The rapid evolution of data scraping technologies, computer programs that can extract data from a website or platform, has raised a wide range of challenging questions for the legal community as the internet has evolved. Many issues related to data scraping are playing out in the courts, yet uncertainty remains around whether and how data scraping should fall under the Computer Fraud and Abuse Act (CFAA).
High-profile examples of issues related to data scraping include Clearview AI, a maker of facial recognition software that scrapes images from a variety of sources to help law enforcement and other customers identify individuals. Another example was Facebook’s dismantling of the NYU Ad Observatory, which enabled the scraping of data from Facebook for academic research.
Such incidents have drawn the attention of governments and regulators, who in turn have to consider whether data scraping is in the public interest, and how to define incidences of “data breaches” and “data leaks.” When is data private, and when is it public? What is the state of play at the borders of computer misuse and data scraping? What does past and current litigation tell us about the future of data scraping law and practice?
These questions were at the heart of a virtual “fireside chat” convened by the Center for Long-Term Cybersecurity on February 3, as two scholars from the UC Berkeley School of Law — Orin Kerr, Professor of Law, and Tejas Narechania, Assistant Professor of Law — addressed a range of questions related to data scraping in the context of the law. The event was the first of a series of public conversations hosted by CLTC that will explore the legal, technical, and policy questions and solutions that emerge when data is scraped at scale.
Following are excerpts from the conversation, edited for length and content. The full discussion is available above or on YouTube.
Narechania: [In the legal case hiQ Labs, Inc. v. LinkedIn Corporation] ,we see how data scraping blurs the lines between what is public and what is private — or what feels public and private. It’s public in the sense that we can access it online. But it’s private in the sense that it’s private to LinkedIn, at least in aggregate. But data scraping makes it possible for someone else to use that public interface to access the massive aggregation of data that feels more private.
Kerr: I think of it less as a question of public and private as that which is thought to be hidden and exposed. The Computer Fraud and Abuse Act (CFAA) prohibits unauthorized access to a computer, but it was drafted in the 1980s, before you had everyone surfing the web all day and before you had companies and people putting private information up on the web. Courts have struggled with it, because you can look at it two different ways. You can say, the nature of a website is that it is inherently a kind of publication mechanism. It is all about making information available to everyone, without any authentication or restriction on access. In that case, it seems odd to talk about unauthorized access. On the other hand, you can say, people may put information on a website but not expect individuals to access it. They may know it’s exposed, but as a practical matter, it’s hidden away. What do you do with that? Is that an unauthorized access? That’s really what the courts have struggled with.
Narechania: We often have flaws or precedents or statutes that are in place, and then have to respond to evolving technological regimes or new societal systems. Is there anything that is unique about this situation that makes that familiar problem more complicated than usual? And how should we think about the CFAA as this old law is applied to a new technological regime?
Kerr: In the early 1990s, one knew what the Computer Fraud and Abuse Act meant, before the web came around. It’s common to say, here’s a body of law that’s well established. Here’s a new technology, and how does the old law apply to the new technology? But this is a legal framework and legal statutory language that was really uncertain. When courts started to play around with this around 2001 and 2002, you had incredibly broad interpretations of what the “unauthorized access” prohibition means, because Congress, in its infinite wisdom, added a civil remedy to the statute, so businesses started to use this to get into federal court. They’d say, “you visited our website, we didn’t want you to visit our website, we want an injunction.” Although the judges didn’t realize it at the time, they were saying that was a crime.
There are different concepts of what “unauthorized access” might mean. It might mean breaking through a code-based barrier, like an authentication gate. I think everybody realizes that’s unauthorized access. That’s classic hacking, in the vernacular. But then there are questions like, what if a person accesses a computer in a way contrary to some express limitation on access? You have very different theories of what authorization might mean. For a long time, there were no cases establishing what the answer is.
Finally, last June, the Supreme Court decided Van Buren V. United States, the Supreme Court’s first Computer Fraud and Abuse Act case, which lays down some markers as to how the CFAA might apply (or not apply) to web scraping. Van Buren was a government employee who was told he could access a sensitive database only for work-related reasons, but then he did so for personal reasons. He was paid a bribe to look up sensitive information on somebody, but it was part of an undercover FBI sting. And the Supreme Court had to figure out, is that unauthorized access because he violated the workplace rule? And the Supreme Court said, that is not unauthorized access, because he had a username and password to access this database. So therefore, the CFAA is sort of an on-off switch: either you have authorization or you don’t, and this express limitation couldn’t change that once he had been granted authorization. The court didn’t answer the big picture question of whether the CFAA is only about this authentication, but it leaves hints that that’s probably where the law should go.
Narechania: It sounds like there is persistent ambiguity as to unauthorized access when it comes to data scraping. If I go to a website, is it okay for me to make a bunch of HTTP calls and collect a bunch of data? Or is that outside the realm of what we think of as a normal website operation? Do you think that data scraping falls within this normal bucket? Is there some other question of authorization that will address the scraping question, such as the terms of service?
Kerr: My best sense is, if we’re talking about scraping, and there’s been no cease and desist letter sent — so it’s just a website that has information and you want to gather the publicly available information — that’s generally going to be held to be legal. This was the gist of the Q Labs vs. LinkedIn case, although that case was vacated, and we’re now waiting for the Ninth Circuit to consider whether the same result applies after Van Buren. I think it does, but we’ll just have to wait and see, but I think that’s probably going to be legal. Courts have generally not adopted the basic norms-based idea that you’re supposed to know you’re not supposed to do that.
There was a case in 2003 called EF Cultural Travel versus Zefer that involved scraping information off of a pricing website for travel. A competitor business knew enough about how the website worked to scrape the data off of it and undercut the pricing offered by the company that had the website.The First Circuit said, we’re not going to get into this kind of norms-based, reasonable expectations of a website concept. It’s just too hard to know how much is enough, especially for a criminal statute, because you could just impose a written restriction that would itself guide authorization. That’s the theory that Van Buren rejects. You have to figure out, is that First Circuit case still valid? We don’t really know. But my own sense is that, if it’s a publicly available website, and there’s no cease and desist letter, you’re probably in the clear from a Computer Fraud and Abuse Act perspective.
Narechania: How does a cease and desist letter change the outcome? And is that the right process, or is there a different or better way to do it?
Kerr: My own personal take is that cease and desist letters are completely irrelevant. All it does is indicate that the computer owner does not want you to do some thing. I think the CFAA is about the technology of what the computer owner has allowed, not what the computer owner’s preferences are.
There was an initial case, Facebook versus Power Ventures, involving a company that had a website that aggregated social media sites. It would take information from sites like Twitter and Facebook and get the permission of users to take the information from those sites and pipe them to Power.com, so you’d see all your social media sites there. Facebook didn’t like this one bit, because of course they sell advertisements for visiting Facebook, and they don’t get that if you’re just having the information sent somewhere else. So Facebook sent a cease and desist letter saying, you can’t do this, this violates our terms of service. And that led to litigation in the Ninth Circuit about whether the Computer Fraud and Abuse Act allowed Power Ventures to nonetheless access Facebook with the permission of the users after receiving the cease and desist letter.
To make it extra complicated, in hiQ Labs, Inc. v. LinkedIn Corporation, there was also a cease and desist letter, but that was a publicly available web page, so you didn’t need a LinkedIn account to access. The Ninth Circuit said that the cease and desist letter withdrew authorization, and it was a violation of the Computer Fraud and Abuse Act after receiving the cease and desist letter to access the account. And then in a second case, the cease and desist letter didn’t matter because it was a publicly open website.