#253 - DARPA’s AI Cyber Challenge Unveiled (with Andrew Carney)

October 06, 2025

Dive into an exciting discussion on CISO Tradecraft as host G Mark Hardy engages with DARPA's AI Cyber Challenge director, Andrew Carney. Learn about the world of autonomous systems capable of identifying and fixing vulnerabilities at an unprecedented speed and scale. Discover the highs and lows of AIxCC's two-year journey, its groundbreaking impact on cybersecurity, and the potential it holds for the future. Whether you're a seasoned CISO or just passionate about cybersecurity, this episode is packed with insights on leveraging AI to protect critical infrastructure and defend against cyber threats. Don't miss it! https://aicyberchallenge.com/

Episode Transcript

#253 - DARPA’s AI Cyber Challenge Unveiled (with Andrew Carney)
===

G Mark Hardy: [00:00:00] Hey, if you've been to Black Hat or Defcon or something else like that, you might've seen the AI Cyber Challenge, but how would you like to talk to the person who's been running it? Stick around. You're gonna enjoy this.

G Mark Hardy: Hello, and welcome to another episode of CISO Tradecraft, the podcast that provides you with the information, knowledge, and wisdom to be a more effective cybersecurity leader.

My name is G Mark Hardy. I'm your host for today, and I have Andrew with me. Andrew Carney is the director of DARPA's AI Cyber Challenge. Welcome to the show.

Andrew Carney: Thank you so much for having me.

G Mark Hardy: first of all, it sounded like a really important thing to do, so tell me a little bit about your background. What led you to get to this point in your career and taking on this role?

Andrew Carney: Yeah. so I've spent most of my career pretty steeped in reverse engineering vulnerability research, kinda disciplines. [00:01:00] I, did my undergrad and master's in computer science. I discovered that sort of, digging into the, assembly and inner workings of compiled code was really exciting.

and then I discovered Capture the flag, discovered competitive hacking. And then my day job for, nearly 15 years was really focused on, like I said, vulnerability research. And then in the evenings I would play CTF. and yeah, just this, a lot of really understanding systems, software systems and hardware systems at every possible level, eventually led me to darpa.

where, I spent, have spent, seven years plus in one way or another, working on, different cybersecurity programs.

G Mark Hardy: Wow, that's fascinating. So you're a real hands-on practitioner. A lot of times you think of someone from darpa. Yeah, they joined as a, GS one doing the time tick for [00:02:00] WWV and then worked their way up through the pay scales. but you're a hands-on guy, which is awesome, and sound like you're just the right person to do that.

Now, for people who aren't familiar with the AI Cyber Challenge, kinda known as AIxCC, what is it? And then why did DARPA decide that? now makes a good time to do it. In fact, it's been going for a while, hasn't it?

Andrew Carney: We, we kicked off, in July or August of 2023. and we, just had our sort of final event at Defcon, this past August. So it's been quite, quite a roller coaster, two years. But, not, I'm getting ahead of myself. so AIxCC, is a competition designed to, challenge competitors to develop autonomous systems that can find and fix vulnerabilities in source code, sort at speed and scale.

So at the speed and scale that we need to secure the internet at large. that's really the focus.

G Mark Hardy: Wow. So we're really, what [00:03:00] you're doing is you're focusing on using AI in a positive way instead of, AI to hack. It's actually an anti hack tool. It's designed to go ahead and identify vulnerabilities, but instead of exploiting them. To go ahead and does just identify it, does it just wave a little flag and say, hey, over here?

Or does it go, Hey, wait a minute, I fixed it for you, boss. We're all good, or somewhere in between.

Andrew Carney: So the systems, generate patches that are have to pass the same tests that you would have to pass as an individual contributor to that, to a real software project. So they're passing every unit test. they're passing every. they're pressing a set of private tests that we developed as part of the competition organizers to ensure that the, patches did not, prevent the software from functioning as intended.

this idea of non-interference, right?

G Mark Hardy: Yeah. is the focus on anything in particular like critical infrastructure? Because I saw a lot of critical infrastructure and I walked around that display at Defcon, or is this for home users [00:04:00] of, Windows? 10. Hoping not to have to upgrade next month.

Andrew Carney: Yeah, so we, live in a world today where our critical infrastructure, both depends on open source software and sort of volunteer contributors and maintainers, and at the same time is under attack all the time. From nation state level adversaries, we read about in the news all the time about these sorts of attacks.

and there's a, an incongruency there, there's a mismatch between the attention level of effort and I think our ability to secure that environment, and it's important to our daily lives. The idea with the competition was if we can, develop tools that take existing program analysis techniques, software kind of analysis techniques that have been, developed for decades and.

Pair them with this sort of new on the scene, generative ai LLMs flavored technology, and really see what can they [00:05:00] do, to make a dent in this sort of massive kind of technical debt, issue we have with just. relying on software that we can't secure fast enough. I think that this idea that we, get really excited about developing new features and like putting more layers on our tech stacks.

like that's how we progress. That's how we sort of advance, as a society. but at the same time, we need to make sure that foundation is rock solid. AIxCC in a lot of ways, was focused on critical infrastructure to solidify that foundation to, to make it resilient and robust so that we could keep building kind of new, exciting technologies on top of it.

G Mark Hardy: And that makes a really good place to start because you said we all depend upon it. I know that I'm down here in the Tampa St. Pete's area. I think it was last year. The old smart water supply got tampered with by a external actor. it was detected fairly early, and so no serious damage done. But the point is that if you look at a lot of this critical [00:06:00] infrastructure, it's not run centrally by the government.

we don't have a whole team of people of NSA all tasked with defending American's critical infrastructure. A lot of these guys are on their own. They talk to power companies and they might share one cybersecurity consultant among three or four of them. And so it best you got some fractional attention.

So as we look at the challenge going forward, I think being able to leverage AI is gonna be absolutely huge for organizations that either don't have the budget or the understanding to apply smart people to the problem. And quite honestly, getting the best people, is a little bit hard sometimes when you're gonna go ahead and say, hey, you can come work for the local water authority or something like that.

It just doesn't sound as sexy as doing something really cool. but when you do this competition. it was a two year competition, but the prize money was non-trivial. I saw the first prize team Atlanta, $4 million prize, trail of bits, 3 million theory 1.5. How many teams started the challenge and then what made these teams exceptional relative to the [00:07:00] others?

Andrew Carney: a year and a half ago we had over 90 teams registered from around the world, all with, an interest in taking on this problem. and then by the time we got to the beginning of our semi-final event a year ago, we had just over 40 teams. So there was some attrition. I think folks realized this was a lot harder.

I think in general, the sort of automated program repair, the patching component is an extremely hard problem. it touches on some very like core program synthesis challenges that are open, hard problems to solve, in the research community. And we were asking teams to do that and do that on real software.

So the challenge was extremely hard. We honestly, leading up to semifinals, we, really. We're, excited to see proof of life. So a year ago, we gave the teams five large open source projects, [00:08:00] and we created synthetic kind of versions of synthetic forks, that had hundreds, tens to hundreds of, synthetic commits that introduced new novel vulnerabilities that no LLM had ever seen before, that no team had ever seen before.

Then we handed those repositories to the systems, with no human intervention, and they were able to find a respectable number of the synthetic vulnerabilities and patch. Also a respectable number, but the really, I think the key issue there was one of the teams in the, found a zero day, they found an issue in SQL light that we got to disclose.

and so then we were off to the races. Then it was okay. We know that if we combine LLMs and program analysis techniques that we can find and patch. real vulnerabilities. Now let's see how far we can push the teams. At the same time, the industry is pushing the models. [00:09:00] and that sort of led us to DEFCON this past year, which is very exciting.

where the teams found nearly 80% of all the synthetic vulnerabilities we gave them, patched over 60% of them. and this is in 54 million lines of code. So a massive, like 20 plus repositories from rail open source projects, and then found 18. Zero days, of which they patched 11. and we're still going through the disclosure process on those, just to make sure that they're remediated properly.

so it's been a really exciting process and, showing both the security research community that, that we can combine these technologies in, a much more effective way. but also that now all of the team's tools are available, open source. they're available to the public at large to use.

and so we, really, we, want this technology to [00:10:00] secure be the sort of a rising tide that lifts all boats in the software security space.

G Mark Hardy: now do we concern ourselves? from darpa, from the natural thought about national security concerns about entities that are not friendly. to the United States of America exploiting either the first half of this to say, whoa, look what we found. Or the second half to say, let's keep these Americans out of our systems.

And yet you're an honest broker through this whole process to say, this is science. this isn't politics. Is that a correct evaluation?

Andrew Carney: 100%. we don't have anything to hide here. The program is wholly focused on defense. And I would say too that the program's bar for what constitutes, a software defect or a vulnerability is, I think reasonable enough where patching is warranted, but not so mature or exquisite that it's, super useful as a, on the [00:11:00] to exploit.

G Mark Hardy: we're still early on in a way, because you've just wrapped this up now, it seems to me, my thought is that this has been going for two years and the world has been shifting massively. How closely did the final goalpost look like to those that you would advertise to the 90 plus teams who decide to get down to the field and what was that really part of the expectation that, yeah, this is gonna be a moving objective.

Andrew Carney: I think that's one of the big differences between this challenge and some of DARPA's other, challenges. when DARPA runs a challenge, one. The goalposts tend to be set so far away from where we are currently, that any progress is useful. In the first DARPA Grand Challenge with self-driving cars, none of the cars made it to the finish line.

But that was a super useful set of data points for identifying, okay, this is the edge of what we can do today. and then in the following event, that multiple cars finished. So [00:12:00] it's, I think, a useful, metric just in general of what we are, what we're capable of, when we really apply ourselves as a society, as a community of researchers.

G Mark Hardy: intellectual pump priming, if you will, you're getting people to focus on something when something you might not otherwise do but in the context of a challenge, in the context of something that's organized, and of course the esprit de corps that goes with going after a prize and eventually potentially even winning.

It will keep people motivated long into a night and evening, a weekend, and all the. Extra time that it takes to accomplish all that. So for chief information security officers, one of the things that I think the first thought is that, okay, can I just go ahead and download this stuff and then turn it on and now I don't have to worry about my current team that's doing it.

I can, deal with the big boss who wants me to cut head count. They want AI everywhere. Or is this really just something that is only gonna put smart [00:13:00] people on steroids? But you don't really want to turn over the car keys and say, yeah, have at it

Andrew Carney: I see these sorts of technologies generally as force multipliers. I think when we're thinking about, the folks that are attacking our networks availability of exploitable vulnerabilities is not typically the limiting factor for them. but on the defense side, it is very challenging to do vulnerability management across all the different devices, across all the different code bases that we have to deal with.

So I think the idea that AIxCC technology, which is available and can be used with some, amount of scaffolding, It can be used, it does not displace anyone. It empowers 'em, because defense is still so incredibly challenging. I think the real opportunity here.

So in the final competition, just, [00:14:00] just over a month ago, we demonstrated that you could reason over millions of lines of source code, find real defects, and then patch them for a few hundred dollars per issue. That the sort of the cost, it's a very cost effective approach. but those patches may still require, review by humans depending on the underlying.

Code base, the types of defects, the types of issues they're finding that can be tuneable based on your preferences. I'm also a program manager at ARPA H where I focus on healthcare, cybersecurity, and ARPA H was an active collaborator with DARPA on this challenge. and in healthcare we have a lot of challenges related to complexity.

and also, but our concerns are slightly different than some other, I'd say security use cases. So in healthcare. Availability is king. If we're thinking about the CIA triad, in other domains, that's not the case. So failing open in healthcare, may be preferable to failing closed, which [00:15:00] might be ideal in another use case.

So the idea that we can start reasoning over and having, solutions, having these patches in a very timely kind of cost effective way is really exciting. I think from a vulnerability management perspective, because that means that. it's, we can as customers, we can talk to our, software vendors, talk to our supply chain and encourage the use of this technology.

and DARPA is also interested in helping people apply this technology to their software, to their infrastructure. happy to plug our, availability for engagement on this front.

G Mark Hardy: and there's no reason not to do we're good. But an observation, I'm thinking because this. Challenge was done in a controlled environment. Now when we say, Hey, can we extend this? I can download the tools as a see, so I can tell my team, Hey, dive in. It's Christmas early.

Go ahead and [00:16:00] grab some tools and have some fun with them. But do we think this is gonna work well in today's messy world of legacy systems, we've got third party vendors who may or may not be patchy. We may not even be able to control that. We've got stuff we haven't done for a while. The technical debt that's been building up because we don't have the resources.

is, that gonna just simply say, Hey, it's like a kid in a candy store and it's gonna find problems all over the place? Or can you know, can we keep it focused in a way that we can prioritize what we're doing to reflect our business desires to say, yeah, you can look at a lot of stuff, but let's aim this canon in this direction.

'cause that's gonna be the biggest benefit.

Andrew Carney: So it's absolutely tuneable and focusable. I would also say that it's I think the success of the, competition and the tools being available, I think, asks the question of us all. the reason that we have to prioritize, the reason that we triage, the reason that we manage kind of our attack surface, is because we have limited resources with which to address a subset [00:17:00] kind of, of that attack surface at any given time.

and I think AIxCCs. Sort of price point here points us to a world, where maybe maybe we can actually fix everything. and when I say fix everything, I mean that we can address like all the low hanging fruit and like increasingly address like issues further and further up the difficulty or complexity, scale.

and that, we can, simplify the remaining vulnerability management tasks in a big way. I think that's, we're not there yet. but also this is the worst the technology will ever be. it's only going to get cheaper. Yeah. it's

G Mark Hardy: gonna get cheaper, it's gonna better, gonna get faster and things like that. But of course, as I mentioned a little bit earlier, we still have the concern of adversary use of ai and is DARPA seeing anything that suggests that the capabilities, if this were considered to be some sort of a, digital arms race, how are we doing?

Is this something where the. [00:18:00] Advantage tends to go to the offense, which is a little bit different than conventional warfare, where typically the advantage goes to the defenders. what are you seeing there? If we look at it from a game theory perspective, is where is the advantage today?

And is it gonna stay that way?

Andrew Carney: So I think, I think this type of technology that AIxCCs sort of approach to addressing the issue, is an opportunity for defenders to, to really gain an edge. Even when we don't have source code, there are lots of program analysis techniques that let us reason over binaries, that let us reason over network traffic that let us reason over Software artifacts, data artifacts in general, and the idea that we can use different flavors of AI powered systems to reason over those, while also using those kind of more traditional program analysis techniques to [00:19:00] weed out hallucinations to prevent false positives. it's really exciting and I think an interesting set of use cases that other, like scientific domains especially might struggle with because they don't necessarily have that, that, Deterministic, like ground truth, analysis piece.

So I'm

G Mark Hardy: So

Andrew Carney: very excited.

G Mark Hardy: Yeah, it does sound really exciting. So as we go forward and more organizations say, Hey, the AIxCC has produced something of great value, we're gonna adopt some of this. Put it to work, and hey, it's actually doing good stuff. Now, as you had said, it's always useful when you're dealing with AI to have a human in the loop.

But if somebody decides to go ahead and just, cross the streams and connect the wires there, and an AI generated patch fails or even causes an outage and things like that. What are your thoughts on liability? should chief Information officers think about governance or compliance expectations?

where's the risk if you look at from the tools not [00:20:00] doing exactly what we hope they would do?

Andrew Carney: I I think, the, one of the ways I think when we think about non-deterministic kind of ai, we're talking about an unreliable source of information and, task completion. the good news is that we have a lot of experience working with those. They're called humans. so the idea I, think, I think, certain, certainly. And, arguably we can tune our work, our use cases, and the workflows with the AI based systems in potentially, I think, a much more transparent way than we can potentially, change the behavior of, peers, colleagues, et cetera. and it can also be a collaborative engagement.

so the idea of having an AIxCC kind of system or a cyber reasoning system that's a CRS is the term that we've, been using for a while at darpa. The idea that the CRS can be this sort of collective combination of [00:21:00] knowledge, where multiple people are engaging in improving the sort of, the more deterministic kind of process management, and strategy and tactics component.

that sort of is keeping the, Less deterministic components, making, making them reliable or as reliable as possible. I think that's super exciting. we saw something interesting I think during the competition. As I mentioned before, automated program repair is this like hard program synthesis problem.

the idea that all of the work you do, finding a vulnerability. Actually improves your ability to patch it in that moment I think is very powerful. so that by combining these two, there is a sort of greater than the sum of its parts element here, where all the knowledge you have about the program, about the control flow, about the constraints, about the different data types, all of that, that led you to the vulnerability you can immediately inform.

Your patch generation, [00:22:00] which is not really how we do patch development, in kind of software development life cycles generally. there are multiple sort of manual steps or the, transmission of information that we, play a little bit of a game of telephone, right? We lose, information throughout that process.

So I think that's another interesting space where the closed feedback loop, on the automation side is potentially very powerful. and I think. Would be a net benefit in terms of reducing liability and risk.

G Mark Hardy: Interesting. And that makes good sense. now one of the things we look at, we have. Software that we purchase or release. I guess we, we never really own the software unless we write it ourselves, but there's a lot of open source that's out there. Is there somebody or some entity that would make the logical lead in applying these type of AIxCC tools to open source?

Because it has, of course, a very large benefit, but it's one of those things you don't, you do it why don't I do it? let's get Mikey. He'll do [00:23:00] it.

Andrew Carney: So I think, I don't think there's any one, organization necessarily. I, do think so. We're, we've been working with the Linux Foundation. And the Open Source Security Foundation and a few other sub foundations there, to ensure that we're able to connect with open source project maintainers, effectively.

and so they've been incredibly helpful as partners throughout the last few years and continue to be as we take the technology that's been open sourced and get it into these project, dev pipelines. I think, it's. I will also say that if you're a standalone kind of software developer or project maintainer that, email us at ICC at DARPA Mill, and we are actually happy to help, provide, support and resources.

if you work on a critical infrastructure related project. we want to help folks see [00:24:00] this kind of work get used, especially if they're, involved in critical infrastructure.

G Mark Hardy: Yeah, we'll put, go ahead and put that in the show notes, but [email protected], would be way to get in touch with you. Awesome. about if we're trying to convince leadership board of directors, executives about the importance of this, you're a technical. Expert, you've had a technical career and so you get it.

But if we're trying to talk to, the pointy haired boss, if you will, in the coroner office, what is it that translates into actionable decision criteria for them?

Andrew Carney: I think, this is probably one of the most evidence-based actions that you could empower your CISO to take with your software supply chain. I'd say if you want your CISO to be a fall person and just fire them the next time you inevitably have some sort of, incident, I guess that's a choice.

But if you want someone that's gonna proactively start. Like securing your software supply chain and your [00:25:00] internal kind of dev pipelines. these are exactly the type of tools that, that you would want to leverage. And like I said, the cost is a few hundred dollars per issue discovered, and like I said, they will only get cheaper.

I, think we're rapidly going to see, I think there's a lot of evidence that these systems produce on various aspects of software, both from a performance perspective, a safety perspective, security perspective, that are all byproducts of doing this sort of automated, feedback loop analysis.

And so I dunno, I think there's lots of opportunity here to reduce your overall liability, or at least be aware of it and then know, be prepared for, the, and not be surprised. I guess Being surprised in the security space almost always feels bad.

G Mark Hardy: awesome. Any last final thoughts you wanna leave before we wrap up or, I

Andrew Carney: I, [00:26:00] automated patch generation is,

G Mark Hardy: It's the future.

Andrew Carney: It's the future. it's now, it's available. It's cheap, it's fast. and we will help you use it. I think that's the other we, DARPA and ARPA H will help you use it. if you would like that help.

G Mark Hardy: Awesome. Andrew Kearney from darpa, thank you very much for talking to us about the AI initiative. I appreciate it. For our listeners out there, if you're not already a subscriber to CISO, Tradecraft, don't forget to subscribe. Tell your friends about it. We can help them in their careers as well. follow us on LinkedIn.

We have a lot more than podcasts. We'll also have a substack newsletter. We also have postings that are out there to help you in your cybersecurity career. If you like this episode, give us a thumbs up and go ahead and, get in touch. [email protected] follow through 'cause you're gonna find out that these people have done some amazing things that will benefit you.

So thank you very much for your time and attention. This is your host, G Mark Hardy, and until next time, stay safe out there.

#253 - DARPA’s AI Cyber Challenge Unveiled (with Andrew Carney)

Join Our Free Trial