Reliability in Copilot Research: Measures for Consistent and Reproducible Results

cover
4 Mar 2024

Authors:

(1) Xiyu Zhou, School of Computer Science, Wuhan University, Wuhan, China;

(2) Peng Liang, School of Computer Science, Wuhan University, Wuhan, China;

(3) Zengyang Li, School of Computer Science, Central China Normal University, Wuhan, China;

(4) Aakash Ahmad, School of Computing and Communications, Lancaster University Leipzig, Leipzig, Germany;

(4) Mojtaba Shahin, School of Computing Technologies, RMIT University, Melbourne, Australia;

(4) Muhammad Waseem, Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland.

V. THREATS TO VALIDITY

Construct validity refers to the extent to which a research tool or method can accurately assess the variable or concept being measured, and whether the results obtained are consistent with the RQs, which is a critical aspect of any research study that involves the measurement of variables or concept. As the processes of data labelling, data extraction, and data analysis in this study were conducted manually, there is a risk of introducing personal bias. Therefore, we implemented various strategies to enhance construct validity. Before each formal step of the research process, we conducted pilot experiments to test the validity of the methods and the standards between different researchers. The first author and the third author would engage in discussions after each step of the pilot experiment and formal research procedures to evaluate the results and ensure consistency with the RQs. If any disagreements arose, the second author was consulted to facilitate reaching a consensus and to further improve the validity of the study.

External validity refers to the generalization of the findings in this study. For our research, the primary threat to external validity is the selection of data sources. To maximize external validity, we chose GitHub Issues, GitHub Discussions, and SO posts as data sources. GitHub Issues is a tool used to report and track software issues, allowing users to report errors, request features, and raise questions to developers. While GitHub Discussion is a new feature on GitHub that aims to provide a more open and organized platform for users to communicate and share insights with other community members. As a popular Q&A community, Stack Overflow is also a platform for many developers to engage in discussions and share insights regarding Copilot usage. These platforms contain a diverse and substantial amount of relevant data, and their data are complementary to each other. Consequently, we were able to collect diverse usage-related data of Copilot from a large number of developers and projects from these three data sources. However, despite all these efforts, we admitted that there may still be relevant data that we missed.

Reliability refers to the degree to which a research method can consistently produce reliable and reproducible results. To minimize potential uncertainties arising from the research methodology, we have implemented multiple measures to maximize the reliability of our study. We conducted a pilot labelling to assess the consistency of the two authors prior to the formal data labelling process. The Cohen’s Kappa coefficients of the three pilot labelling processes are 0.824, 0.834, and 0.806, indicating good agreement between the authors. Throughout the data labelling, extraction, and analysis process, we thoroughly discussed and resolved any inconsistencies within the team to ensure the consistency and accuracy of the result. Additionally, we have made available the dataset of the study [13] to enable other researchers to validate our findings.

This paper is available on arxiv under CC 4.0 license.