Grant / November 2024

An Interpretability Study of LLMs for Code Security

Large language models (LLMs) such as ChatGPT have greatly advanced coding tasks but often fail to generate secure code. Current approaches to improving code security, relying on fine-tuning, struggle with robustness and generalizability. This proposal explores LLM interpretability to enhance secure code generation. By employing bottom-up (e.g., sparse autoencoders) and top-down (e.g., representation engineering) techniques, we aim to understand how LLMs internally represent code properties and security across tasks and vulnerability types. We will study training dynamics using model checkpoints and smaller LLMs to assess how these representations develop during pretraining and finetuning. Building on these insights, we propose advanced monitoring and control mechanisms to detect, intervene, and guide code generation in real time. Techniques such as representation engineering and representation intervention will enable precise manipulation of the generation process. We also plan to refine fine-tuning methods, emphasizing internal feature control to improve security comprehensively. This work seeks to create a robust framework for secure and reliable code generation in LLMs.

Topics

#machine learning (ML)

An Interpretability Study of LLMs for Code Security

Topics

Related Research

Survey of Search Engine Safeguards and their Applicability for AI

Improving the Explainability of Artificial Intelligence: The Promises and Limitations of Counterfactual Explanations

Benchmark Early and Red Team Often: A Framework for Assessing and Managing Dual-Use Hazards of AI Foundation Models

Help build and expand our future-focused research agenda

Subscribe to our mailing list