This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Mohammed Latif Siddiq, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame;
(2) Joanna C. S. Santos, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame.
Table of Links
- Abstract & Introduction
- Background and Motivation
- Our Framework: SALLM
- Experiments
- Results
- Limitations and Threats to the Validity
- Related Work
- Conclusion & References
4 Experiments
This section describes the research questions we address in our experiments (§ 4.1) as well as the methodology to answer each of these questions (§ 4.2–4.4).
4.1 Research Questions
We aim to answer the following questions:
RQ1 How does SALLM compare to existing datasets?
First, we demonstrate the value of our manually curated dataset of prompts by comparing it to two existing datasets: LLMSecEval [67] and SecurityEval [63]. The goal is to contrast the coverage of vulnerability types (CWEs) and dataset size.
RQ2 How do LLMs perform with security-centric prompts compared to the evaluation setting used in the original studies?
As explained in Section 2.3, LLMs are evaluated with respect to their ability to generate functional code (not necessarily secure). Thus, in this question, we evaluate the models’ performance on the datasets they originally used and compare them to their performance in our dataset.
RQ3 How can we use SALLM’s assessment techniques to prevent vulnerable generated code from being integrated into the code base?
This research question explores the usage of our assessment techniques to detect the vulnerable code generated by the model integrated into the code base. To answer this question, we obtain a dataset [71] of code snippets generated by ChatGPT that were publicly shared on GitHub commits or inside a source code comment.
4.2 RQ1 Methodology
To answer our first research question, we compare SALLM’s dataset to two prior datasets of prompts used to evaluate the security of LLM generated code:
• SecurityEval dataset [63]: It is a prompt-based dataset covering 69 CWEs, including the MITRE’s Top 25 CWEs with 121 Python prompts from a diverse source. The prompts are signatures of Python functions along with their docstrings and import statements.
• LLMSecEval dataset [67]: it is a natural language (NL) prompt-to-code dataset crafted from Pearce et al. [51]. This dataset covers MITRE’s top 25 CWEs and contains 150 NL prompts to benchmark the code generation model.
We compare these datasets according to two dimensions: (I) number of supported vulnerability types (CWEs); and (II) dataset size (number of prompts).
4.3 RQ2 Methodology
We investigate in RQ2 the performance of existing LLMs when evaluated using SALLM, our framework. To answer this question, we provide each of the 100 prompts in our dataset as inputs to four models from three LLM families:
• CODEGEN [47] is an LLM for code generation trained on three large code datasets. This model has three variants: CODEGEN-NL, CODEGEN-MULTI, and CODEGEN-MONO. CODEGEN-NL is trained with the Pile dataset [17] is focused on text generation. The CODEGEN-MULTI is built on top of CODEGEN-NL but further trained with a large scale-dataset of code snippets in six different languages (i.e., C, C++, Go, Java, JavaScript, and Python) [27]. The CODEGEN-MONO is built from CODEGEN-MULTI and further trained with a dataset [47] of only Python code snippets. They also released another version called CODEGEN2.5 [46] which is trained on the StarCoder data from BigCode [31]. It has a mono and multi version. Since the latter variant is focused on Python-only generation, we use CODEGEN-2B-MONO and CODEGEN-2.5-7B-MONO to generate Python code.
• STARCODER [35] is an LLM with 15.5B parameters trained with over 80 different programming languages. This model is focused on fill-in-the-middle objectives and can complete code given a code-based prompt.
• The GENERATIVE PRE-TRAINED MODEL (GPT) [8] is a family of transformer-based [68] and task-agnostic models capable of both understanding and generating natural language. We used the latest OpenAI’s GPT models, i.e., GPT-3.5-TURBO and GPT-4, which are tuned for chat-style conversation and powers a popular chat-based question-answering tool, ChatGPT [2] and its paid variant (ChatGPT plus).
For each model, we generate 10 code solutions for each prompt with 256 new tokens and varying temperature from 0 to 1 by increasing by 0.2 (i.e., 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0). We selected 256 as the token size to generate because we observed that the insecure code in our dataset has an average of 54 tokens and a maximum of 245 tokens. Thus, a 256 token size would be sufficient for the models. In the case of the GPT models, however, we made the token size double this value (i.e., 512) because they can generate an explanation along with the code (which would consume tokens).
After obtaining the generated code solutions from each model, we measure and contrast the performance of these models with respect to three metrics: pass@k [10], vulnerable@k and secure@k (the last two, are our novel metrics, as defined in Section 3.3.2). In our experiments, we choose k to be equal to 1, 3, and 5. This is because our goal is to evaluate these models for typical use scenarios, where developers will likely inspect only the first few generated code snippets by a model.
4.4 RQ3 Methodology
In this question, we investigate to what extent the assessment techniques in SALLM could help engineers identify code generated with vulnerabilities. To answer this RQ, we collect code snippets generated by ChatGPT from the DevGPT dataset [71]. This dataset contains over 17,000 prompts written by engineers that were publicly shared on GitHub or HackerNews.
This dataset was constructed by finding ChatGPT links (i.e., URLs in the format https:\\chat.openai.com/share/) from these different sources. The search was performed in July and August 2023. Once their web crawler identifies a ChatGPT sharing link, it extracts the code generated by ChatGPT and the corresponding prompt used by the developer to generate it.
After collecting these sharing links, we analyzed their metadata to identify which links are for prompts that request the generation of python code. As a result, we obtained a total 437 Python code samples generated by ChatGPT. For each of these 437, we performed a filtering step where we disregarded samples with compilation errors. Since we found 14 samples that were not compilable,. we excluded those, obtaining a total of 423 Python code samples.
After extracting these Python codes generated by ChatGPT, we run our static analyzer-based assessment technique for each. In our study, we investigate to what extent our techniques can identify which code snippets are vulnerable / not vulnerable.