A $5,000 research award donated by UC Berkeley alum and cybersecurity expert Tim M. Mather ‘81 funded a UC Berkeley student-led research team’s entry into the Artificial Intelligence Cyber Challenge (AIxCC). This national competition is sponsored by the Defense Advanced Research Projects Agency (DARPA), a research and development agency of the U.S. Department of Defense.
The goal of the team’s project, “A CodeLM Automated Repair Program with Analysis, Planning, and Control,” was to develop an automated, AI-based code repair solution that combines both vulnerability detection and patch generation. The team’s members included Samuel Berston, a student in the UC Berkeley School of Information’s Master of Information and Cybersecurity (MICS) program; Marlon Fu, a student in the Master of Information and Data Science (MIDS) program; Marsalis Gibson, a PhD student in the UC Berkeley Department of Electrical Engineering and Computer Science (EECS); and MICS students Katelynn Hernandez, Francisco Laplace, Gerald Musumba, Narayanan Potti, Ansuv Sikka, and Lawrence Wagner.
The team’s project was accepted into the competition, accelerating them to become semi-finalists. While the group did not compete in-person at DEF CON 2024 in Las Vegas, their participation had a significant impact on the entire competition. The team discovered a serious vulnerability within the files provided by DARPA as part of the challenge, which they reported to the organizers, leading DARPA to send a security patch to all participants.
We interviewed the team’s tech lead, Marsalis Gibson, to learn more about the team’s experience.
What was the core challenge you were trying to solve?
DARPA wanted us to build an AI-based solution that could take an open source project like Linux — a large project used often for critical infrastructure — and develop an automated repair program that could detect CVEs [i.e., “common vulnerability and exposure” — known, documented, frequently exploited vulnerabilities and exposures in software]. The point was not just to detect vulnerabilities, but also to suggest a patch. They wanted us to leverage the power of AI to create a repair program.
What was your approach to the challenge?
We knew it would be difficult to develop a novel solution in the limited amount of time, so our approach had three parts. First, there was the detection part of the system. How do you detect vulnerabilities? Then there’s the patching part: even if you know what vulnerabilities there are, how do you produce a code patch? And then third, how do you understand and evaluate whether your patch and detection are correct or not? We had to come up with approaches for those three parts, which is why it was a really hard challenge.
For our approach to detection, we used an existing solution, CodeQL, which lets you produce a database representation of your code, after which you can query for vulnerabilities and errors. We wanted to test to see how well that did on the challenge projects that DARPA submitted. We then wanted to compare that with simply prompting a large language model like Google Gemini or ChatGPT. We ended up using the open source solution.
For the patching part, we originally wanted to prompt an LLM with a step-by-step instruction for how to patch, based on an instruction template with a vulnerability type. What we actually did was use someone else’s base research from a few years ago, which used a prompt engineering approach to prompt different LLM models. Those researchers had compared ChatGPT, Gemini, and a few others, and had a method and a metric so they could compare the correctness of the patch. They could check if the program compiled with the new patch. Then they could say, if certain tests are breaking, then this might not be a good patch. They could produce a bunch of patch candidates.
The challenge for us was that our solution had to detect and patch vulnerabilities for a whole repository of code. For example, the Linux Kernel, which was the first challenge project released in the competition, has millions lines of code and thousands of files. We needed to take the solution and figure out how to apply this to a whole repository of code. That’s where our “quick fix” novel solution came in, though it wasn’t going to be a quick fix.
The challenge was, how do we apply the patches for individual files and turn that into a program that can detect vulnerabilities and suggest patches for whole repositories of code?
How did it come out?
Overall, we weren’t able to create a fully workable system because of our time constraints; however, we were able to implement a system architecture ( a controller and server system design), test a potential detection solution, and test a very basic LLM prompting engineering solution to see how LLMs may perform as a patching solution. We also did a huge amount of reading papers related to existing solutions, talking to other faculty in the field, and speaking with security professionals.
In this entire process, we ultimately learned that utilizing maching learning (ML) for highly accurate and effective patching and detection of vulnerabilities is still a hard problem. There are still major challenges to overcome before we will have effective solutions, and I would recommend that, now and in the future, we continue to rely on the expertise of human security professionals, integrating humans and ML solutions together, rather than ML as expertise replacements.
While working on the project, your team ended up discovering a vulnerability in the DARPA files. What happened?
I initially used my lab machines to host our project, and a few days after I set it up, my lab mate got a notification from the Berkeley security team saying that our machine might be compromised. We worked with Kaylene, the person responsible for maintaining our lab system, to try to pinpoint what could be the problem. I told her that I had just installed these programs for this challenge.
With the help of one of Kaylene’s friends, we were able to locate the issue. We ended up finding that, in one of the repositories that DARPA sent out to all of the competitors, they were exposing unnecessary ports in their docker containers. That means that anyone who’s within the network could simply scan your machine and do whatever they wanted to with it, because the program was listening on that port unnecessarily.
I ended up sending an email to DARPA, letting them know what happened. They did further investigation and pinpointed the exact lines of where it occurred and what they needed to change.
We gave them a high-level overview of the issue, and relayed that this could actually be really dangerous for anybody who doesn’t have a firewall, or anybody who is running these on a public server. It could be a really critical vulnerability. Once they dug into it and found out how to fix it, they sent an announcement to all the competitors telling them to download the patch. We were the first team to identify and tell them about this issue.
What’s next for the project?
We left our code files in a state where someone can pick it back up. If I have the opportunity, I would love to continue to work on it. My primary research is looking at the security of AI-integrated robotics, which has taken up most of my time. But if there was ever an opportunity and a reason to pick up this project, I would love to do that.