This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Jakub DRÁPAL, Institute of State and Law of the Czech Academy of Sciences, Czechia, Institute of Criminal Law and Criminology, Leiden University, the Netherlands;
(2) Hannes WESTERMANN, Cyberjustice Laboratory, Université de Montréal, Canada;
(3) Jaromir SAVELKA, School of Computer Science, Carnegie Mellon University, USA.
Table of Links
Conclusions, Future Work and References
5. Experimental Design
(RQ1) Autonomous Generation of Initial Codes The quality of the automatically generated initial codes was manually assessed by one of the authors (a subject matter expert). The analysis suggested that, while largely sensible, the codes overly focused on what was stolen, ignoring other aspects of the analysis (e.g., how the offense was committed).
To gauge the extent of the issue, we evaluated whether codes address (even implicitly) how the theft happened and what was stolen (evaluation scheme shown if Figure 4).
Each initial code was first analyzed with respect to ¬How, and if the issue was confirmed the analysis stopped, i.e., the code was not further considered for ¬What.
(RQ2) Generating Initial Codes with Expert Feedback Following the analysis of the autonomously generated initial codes, we formulated compact instructions for the system to mitigate the most commonly appearing issues.
The feedback consisted of two parts: (i) positive (what to focus on) – target, modus operandi, seriousness, and intent; and (ii) negative (what to avoid) – multiplicity of the offense, degree of completion, co-responsibility, value of stolen goods.
We also provided three examples of desirable initial codes such as, e.g. “vehicle theft with forceful entry and disassembly of vehicles”. The instructions were included in the prompt (user message) as custom requirements. With the thus updated prompt, we repeated the generation of the initial codes.
To assess the effects of the provided feedback, the newly generated codes were then manually coded using the same scheme as in the evaluation of RQ1, i.e., ¬How → ¬What → Ok □.
(RQ3) Predicting Themes To evaluate the zero-shot performance of the LLM in predicting themes for the analyzed data points, we employed the theme prediction component of the pipeline to label each data point with one of the themes arrived at by human experts. Note that a factual description may contain multiple themes, e.g., bicycle theft and theft from an open-access place, whereas the experts were instructed to assign the most specific and salient one.
To account for this phenomenon, we instructed the system to also assign each data point with three of the themes. Then, we measured the performance of the system on this task in terms of recall at 1 (R@1) and recall at 3 (R@3).
(RQ4) Automatic Discovery and Prediction of Themes To investigate the performance of the proposed pipeline on the end-to-end task of autonomously discovering themes from the provided data, and assigning each analyzed data point with one of the identified themes, we employed the successive components of the pipeline to: (i) generate initial codes (with expert feedback); (ii) collate the initial codes into potential themes; (iii) group the potential themes into a compact list of higher-level themes; and (iv) assign each data point with one of the high-level themes.
We then compared the automatically assigned themes to the manual themes discovered and assigned by subject matter experts.
This paper is available on arxiv under CC 4.0 license.6