This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Ghazaleh H. Torbati, Max Planck Institute for Informatics Saarbrucken, Germany & [email protected];
(2) Andrew Yates, University of Amsterdam Amsterdam, Netherlands & [email protected];
(3) Anna Tigunova, Max Planck Institute for Informatics Saarbrucken, Germany & [email protected];
(4) Gerhard Weikum, Max Planck Institute for Informatics Saarbrucken, Germany & [email protected].
Table of Links
- Abstract and Introduction
- Related Work
- Methodology
- Experimental Design
- Experimental Results
- Conclusion
- Ethics Statement and References
II. RELATED WORK
Exploiting Item Features: Content-based recommender systems incorporate item tags, item-item similarity, and userside features. Item-item similarity typically maps the tag clouds of items into a latent space and computes distances between the embedding points. This can be combined with interaction-based methods that employ latent-space techniques including deep learning (e.g., [2][8], [12]–[14]) or graphbased inference (e.g., [15], [16]). These methods excel in performance, but experimental results often benefit from a large fraction of favorable test cases. For example, when the model is trained with books by some author, predicting that the user also likes other books by the same author is likely a (near-trivial) hit. In our experiments, we ensure that such cases are excluded.
Exploiting User Reviews: The most important user features are reviews of items, posted with binary likes or numeric ratings. Early works either mine sentiments on item aspects or map all textual information to latent features using topic models like LDA or static word embeddings like word2vec (see, e.g., [17]).
Recent works have shifted to deep neural networks with attention mechanisms [9]–[11], [18]–[20], or feed review text into latent-factor models [21]–[23]. Some works augment collaborative filtering (CF) models with user text, to mitigate data sparseness (e.g., [24]–[26]). However, pre-dating the advent of large language models, all these methods rely on static word level encodings such as word2vec, and are inherently limited.
As a salient representative, we include DeepCoNN [18] in the baselines of our experiments.
Transformer-based inference: Recent works leverage pretrained language models (LMs), mostly BERT, for recommender systems in different ways: i) encoding item-user CF signals with transformer-based embeddings, ii) making inference from rich representations of the input review texts, or iii) implicitly incorporating the “world knowledge” that LMs have in latent form.
An early representative of the first line is BERT4Rec [27], [28], which uses BERT to learn item representations for sequential predictions based on item titles and user-item interaction histories, but does not incorporate any text. The P5 method of [29] employs a suite of prompt templates for the T5 language model, in a multi-task learning framework covering direct as well as sequential recommendations along with generating textual explanations. We include a text-enriched variant of the P5 method in our experiments.
The advances on large language models inspired approaches that leverage LLM “world knowledge”. Early works use smaller models like BERT, to elicit knowledge about movie, music and book genres [30]. Recent studies are based on prompting large autoregressive models, such as GPT or PaLM, to generate item rankings for user-specific recommendations [31], [32] or to predict user ratings [33], in a zero-shot or few-shot fashion, using in-context inference solely based on a user’s item titles and genres.
Closest to our approach are the methods of [34], [35], using BERT to create representations for user and item text, aggregated by averaging [34] or k-means clustering [35]. The resulting latent vectors are used for predicting item scores. A major limitation is that the text encodings are for individual sentences only, which tends to lose signals from user reviews where cues span multiple sentences. Also, BERT itself is fixed, and the latent vectors for users and items are pre-computed without awareness of the prediction task. Our experiments include the method of [34], called BENEFICT, as a baseline.