Authors:
(1) Xueguang Ma, David R. Cheriton School of Computer Science, University of Waterloo;
(2) Liang Wang, Microsoft Research;
(3) Nan Yang, Microsoft Research;
(4) Furu Wei, Microsoft Research;
(5) Jimmy Lin, David R. Cheriton School of Computer Science, University of Waterloo.
Table of Links
Conclusion, Acknowledgements and References
Abstract
The effectiveness of multi-stage text retrieval has been solidly demonstrated since before the era of pre-trained language models. However, most existing studies utilize models that predate recent advances in large language models (LLMs). This study seeks to explore potential improvements that state-of-the-art LLMs can bring. We conduct a comprehensive study, fine-tuning the latest LLaMA model both as a dense retriever (RepLLaMA) and as a pointwise reranker (RankLLaMA) for both passage retrieval and document retrieval using the MS MARCO datasets. Our findings demonstrate that the effectiveness of large language models indeed surpasses that of smaller models. Additionally, since LLMs can inherently handle longer contexts, they can represent entire documents holistically, obviating the need for traditional segmenting and pooling strategies. Furthermore, evaluations on BEIR demonstrate that our RepLLaMA–RankLLaMA pipeline exhibits strong zero-shot effectiveness. Model checkpoints from this study are available on HuggingFace.1
1 Introduction
Text retrieval, which entails identifying and ranking the most relevant documents or text snippets in response to a query, is crucial in various opendomain language comprehension tasks (Petroni et al., 2021), including web search (Bajaj et al., 2016), open-domain question answering (Chen et al., 2017), and fact verification (Thorne et al., 2018). Retrieval also plays an important role in enhancing the effectiveness of large language models (LLMs) in a retrieval-augmented generation (RAG) pipeline (Lewis et al., 2020b; Shi et al., 2023). This approach not only mitigates hallucinations but also enables LLMs to access knowledge that is not captured within their parameters (Yang et al., 2023; Jiang et al., 2023).
A typical multi-stage text retrieval pipeline consists of a retriever, designed to efficiently locate the top-k relevant texts from a corpus, and a reranker, which further refines the order of the retrieved candidates to improve output quality (Nogueira and Cho, 2019). Both retrievers and rerankers have significantly benefited from the advent of pre-trained language models based on Transformers (Vaswani et al., 2017) such as BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020). These models are trained to encode queries and documents into vector representations for retrieval (Karpukhin et al., 2020; Lin, 2021) or to directly score the relevance between a query and a document for reranking (Nogueira et al., 2019; Zhuang et al., 2023).
Recent large language models with billions of parameters, fine-tuned to follow instructions, such as InstructGPT (Ouyang et al., 2022), GPT-4 (OpenAI, 2023), and LLaMA (Touvron et al., 2023a,b), have exhibited extraordinary capabilities in many NLP tasks, surpassing previous smaller pre-trained language models (Zhao et al., 2023). For retrieval, recent methods such as LRL (Ma et al., 2023), RankGPT (Sun et al., 2023), and PRP (Qin et al., 2023) have explored prompting LLMs to perform zero-shot reranking using pairwise or listwise approaches. These methods leverage LLMs by viewing reranking as text generation.
However, we see a number of potential issues. First, these methods do not address the entire multistage pipeline, as it is challenging to cast retrieval from a large corpus as a text generation task. Second, they do not leverage labeled data when available. Finally, these rerankers are not efficient because they do not support parallel scoring and are slowed by their multi-pass decoding design.
Therefore, we argue that fine-tuning state-ofthe-art large language models to function as retrievers and rerankers can yield better effectiveness than previous smaller models. This approach can also optimally utilize LLMs within multi-stage pipelines. Thus, we are motivated to investigate the following research question: How do state-ofthe-art large language models perform when specifically fine-tuned for multi-stage text retrieval?
Our study aims to answer this question by conducting a comprehensive investigation into finetuning the latest LLaMA-2 model (Touvron et al., 2023b), a state-of-the-art, open-source large language model, as both a retriever and a reranker, which we refer to as RepLLaMA and RankLLaMA, respectively. Specifically, we utilize the MS MARCO (Bajaj et al., 2016) and BEIR (Thakur et al., 2021) datasets for our experiments. Our findings suggest that large language models surpass previous smaller models, achieving state-of-the-art effectiveness for both retrieval and reranking through a straightforward training regime and exhibiting strong zero-shot effectiveness. Furthermore, we observe that LLMs, which are inherently pre-trained on longer contexts, demonstrate potential in representing entire documents, thereby eliminating the need for traditional segmenting and pooling strategies for document retrieval.
This paper is available on arxiv under CC 4.0 license.
1 https://huggingface.co/castorini