Grant / November 2024

An Interpretability Study of LLMs for Code Security

Large language models (LLMs) such as ChatGPT have greatly advanced coding tasks but often fail to generate secure code. Current approaches to improving code security, relying on fine-tuning, struggle with robustness and generalizability. This proposal explores LLM interpretability to enhance secure code generation. By employing bottom-up (e.g., sparse autoencoders) and top-down (e.g., representation engineering) techniques, we aim to understand how LLMs internally represent code properties and security across tasks and vulnerability types. We will study training dynamics using model checkpoints and smaller LLMs to assess how these representations develop during pretraining and finetuning. Building on these insights, we propose advanced monitoring and control mechanisms to detect, intervene, and guide code generation in real time. Techniques such as representation engineering and representation intervention will enable precise manipulation of the generation process. We also plan to refine fine-tuning methods, emphasizing internal feature control to improve security comprehensively. This work seeks to create a robust framework for secure and reliable code generation in LLMs.