Beyond Personalization: How LLMs are Revolutionizing Recommendation Systems

13 minute read

Recommendation Systems (RecSys) have become an integral part of our digital lives. These intelligent systems are the invisible hand guiding us through a sea of choices. There are countless number of examples, just to mention a few:

  • E-Commerce: Personalized product suggestions enhance shopping experiences. Think about next time you use Amazon or eBay.
  • Social Networks: Recommended advertisements, friends or content shape our interaction patterns on social media.

This indicates that RecSys are one of the most commercially successful AI applications, enhancing user experiences and driving business growth across industries.

On the other hand, the large language models (LLMs) exploded in recent years. Fueled by the rise of the LLMs the landscape of recommendation has also seen a shift in this direction. These AI powerhouses, with their vast knowledge and impressive language processing capabilities, are poised to unlock a new era of personalization and predictive power in RecSys.

The evolution of the LLMs
The evolution of the LLMs [6].


But first, what are Recommendation Systems

Every recommendation system operates on a set of users, items and the interaction between them. Thus, we have three sets:

  • U: set of users uniquely identified with an ID
  • I: set of items uniquely identified with an ID
  • R = U x I: set of user-item interactions represented as a matrix

Simply put, the purpose of the recommendation systems is to recommend the best possible set of items to a given user based on the interaction history.

As noted above, the interaction history is modeled using an interaction matrix R. The element at position [i, j] represents the interaction record between the user at position i and the item at position j.

The interaction record, or so-called feedback, can be explicit or implicit. Below is an example of an explicit feedback about movie rating:

Example of explicit rating for movies
Example of explicit rating for movies


Both types of feedback have different properties. The table below compares their properties:

Property / FeedBack Type Explicit Implicit
Accuracy High Low
Abundance Low High
Preference Types Positive, Negative, Neutral Positive only, absence is ambiguous
Example Movies rated; items liked Items viewed; items purchased

From Shallow Models to the Generative Frontier

The current landscape of recommender systems can be split into traditional systems, modelling the user-item interaction directly and generative systems exploiting the generative power of the LLMs.

Shallow Models

Traditionally, recommendation systems relied on shallow models, utilizing matrix factorization techniques like LU, QR or SV decomposition to analyze user-item interactions. Thus, the name shallow is used to designate this family of models.

These models aim to decompose the sparse interaction matrix X into the product of two dense matrices, W and H. The row vectors of W represent users, while those of H represent items. This is nicely illustrated below:

Illustration of the matrix factorization concept
Illustration of the matrix factorization concept


These methods, while effective, often fall short in capturing the nuances of user preferences and item characteristics.

Deep Learning-Based Models

The advent of deep learning ushered in a new wave of RecSys, leveraging the power of deep neural networks to process complex data and provide more sophisticated recommendations. Contrary to the shallow models, they consider additional features about the users and items. Moreover, they model the probability of the user interacting with the item. The image depicts the architecture of the RecSys known as Deep & Wide.

The architecture of the Deep & Wide RecSys
The architecture of the Deep & Wide RecSys.


Generative Recommendations

The most recent trend of making LLMs more powerful gave rise to generative recommendations. Unlike their predecessors, generative RecSys possess the remarkable ability to move beyond simple pattern recognition and handle multiple recommendation tasks in the same time.

Traditional recommendation systems rely on a multi-stage ranking process, handling vast numbers of items and users, each with learned embeddings. Generative RecSys, however, leverage LLMs to directly generate recommendations or related content, circumventing the need for computationally expensive ranking. This difference between the traditional and generative pipelines is depcited below:

Traditional vs Generative RecSys Pipeline
Traditional vs Generative RecSys Pipeline [4].


How to Use LLMs for RecSys

The new paradigm to use LLMs for RecSys promises to streamline recommendation processes significantly. As shown in the diagram below LLMs can be integrated into the recommendation pipeline in various ways, including:

  • Scoring/Ranking: Serving as the core model that predicts user preferences and ranks items accordingly.
  • Feature Extractor: Extracting meaningful features from user data in order to enrich user profiles.
  • Feature Encoder: Encoding user interactions into a format suitable for recommendations.
  • User Behavior Modeling: Understanding and predicting user behavior patterns to anticipate future needs and desires.
Diagram sketching where and how LLMs can be used in RecSys
Where and how LLMs can be used in RecSys [5].


This blog post explores how LLMs can be employed as scoring models, a concept closely aligned with the fundamental goal of recommender systems: ranking items and users.

LLMs as Scoring Models

A key advantage of LLMs as scoring models is their versatility across multiple recommendation tasks. This eliminates the need for task-specific models. Some common tasks include:

  • Rating Prediction: Estimating a user’s item rating.
  • Top-K Recommendation: Selecting the top K items to recommend.
  • Sequential Recommendation: Predicting the next item a user is likely to interact with.
  • Explanation Generation: Providing users with transparent recommendation rationales.
  • Review Summarization: Generating concise item reviews based on user profiles.

This blog post focuses primarily on Rating Prediction and Top-K Recommendation. The figure below demonstrates some of the recommendation tasks:

Sketch showing an example of the different recommendation tasks
Example of the different recommendation tasks [3].


LLMs offer various deployment strategies for these tasks:

  • Pre-training: Models are trained on massive datasets of text and code, learning general language understanding and reasoning skills.
  • Fine-tuning: Pre-trained LLMs are fine-tuned on specific recommendation datasets to adapt their knowledge to the nuances of user-item interactions.
  • Prompting: LLMs can be guided to perform specific recommendation tasks through carefully crafted prompts, leveraging their existing knowledge without the need for further training.

Pre-Training RecSys LLMs

In pre-training, techniques like Masked Behavior Prediction (MBP) and Next K Behavior Prediction (NBP) are modeling the user behaviors effectively, providing a robust foundation for fine-tuning tailored to specific recommendation tasks. One line of research in this area is the work presented in “PTUM: Pre-training user model from unlabeled user behaviors”. The pre-training technique is summarized in the figure below:

Diagram showing how to pretrain LLMs for RecSys
How to pretrain LLMs for RecSys [3].


Fine-Tuning RecSys LLMs

Once the LLM models are pre-trained (or not) for recommendation tasks, we can fine-tune them on specific downstream tasks or leverage prompting techniques to utilize their broader capabilities.

The fine-tuning can be performed in two different ways: full fine-tuning or parameter-efficient fine-tuning (PEFT). For instance the work presented in “UniTRec: A Unified Text-to-Text Transformer and Joint Contrastive Learning Framework for Text-based Recommendation” fully fine-tunes the BART model using a contrastive loss.

On the other hand, the research done in “TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation” successfully employs the LoRA adapter with the LLaMA-7B LLM.

Diagram showing how to finetune LLMs for RecSys
How to finetune LLMs for RecSys [3].


Prompting RecSys LLMs

The third technique is to use prompting techniques to adapt the LLMs as scoring functions. The key idea of prompting is to keep LLMs frozen (i.e., no parameters updates), and adapt LLMs to downstream tasks via task-specific prompts. There are three representative methods that use prompting techniques to adjust the LLMs as RecSys as summarized in the figure below:

  • In-Context Learning: elicit the in-context ability of LLMs to adapt to downstream tasks from context.
  • Prompt Tuning: add new prompt tokens to LLMs and optimizes the prompt based on the task-specific dataset.
  • Instruction Tuning: LLMs to follow prompts as task instructions, rather than to solve specific downstream tasks.
Diagram showing the LLM prompt-tuning techniques for RecSys
LLM prompt-tuning techniques for RecSys [3].


A Diverse Landscape: LLMs as Recommendation Engines

The versatility of LLMs has led to a surge of research exploring their use as scoring functions in recommendation systems. The following taxonomy, visualized in the bubble chart below, captures the diverse landscape of these approaches:

Taxonomy of LLMs used as scoring functions
Taxonomy of LLMs used as scoring functions [5].


The extra dimension is whether we will use conventional recommendation models (CRM) during the inference phase. Our focus here is on pure generative inference models, i.e. the ones that do not use any CRM to make any inference. Two noteworthy examples include:

  • P5: A versatile, instruction-based pre-training model demonstrating impressive performance across various recommendation tasks.
  • ChatGPT: Researchers have explored the use of ChatGPT in a zero-shot configuration, showcasing the flexibility and potential of LLMs in recommendation scenarios.


Multi-Task Instruction Pretraining: The P5 Model

Diagram of the P5 model showing how it can perform multi-task recommendation
The P5 model performing multi-task recommendation [1].


The P5 model epitomizes the potential of LLMs in RecSys through multi-task instruction pretraining using the T5 LLM as a backbone. This approach allows the model to handle multiple recommendation tasks such as:

  • Rating Prediction
  • Direct Recommendation
  • Sequential Recommendation
  • Explanation Generation and
  • Review Summarization

The model is based on instruction prompt tuning like the FLAN model. It defines a set of personalized prompts, for different users and items. Let’s dive in and see the details about the P5 model.

Model Architecture

The P5 model employs an encoder-decoder Transformer architecture with the SentencePiece tokenizer optimizing negative log-likelihood.

It uses standard token and position embeddings with additional whole-word embeddings to manage sub-word tokens efficiently. For example, the SentencePiece tokenizer would split the word user_7391 into four tokens (item, _, 73, 91), but then with the whole-word embeddings we can take the entire word.

Diagram showing the architecture of the P5 model
The P5 model architecture [1].


Data

The model is trained on the following Amazon review datasets:

  • Sports & Outdoors
  • Beauty
  • Games & Toys

However, these datasets are not in the desired format the model needs. As the P5 model is based on instruction prompt tuning, we must transform the raw data into instructions i.e. input-target pairs. For this reason, the model defines a set of instruction prompts for each of the tasks. For instance, here are some of the prompts used in the rating prediction tasks:

Prompt ID: 1-1:

Input template: Which star rating will user_{user_id} give
item_{item_id}? (1 being lowest and 5 being highest)

Target template: {star_rating}

Prompt ID: 1-2:

Input template: Does user_{user_id} like or dislike
item_{item_id}?

Target template:
{answer_choices[label]} (like/dislike) – like (4,5) / dislike
(1,2,3)

Using multiple prompt templates for a single task increases language style variation and enables zero-shot evaluation.

These templates are used to format raw data into input-target pairs for model training. The last prompt in each task is reserved specifically for testing the model’s zero-shot capabilities. The figure below illustrates this input-target pair generation process for the rating prediction task:

Diagram showing the process of generating input-target pairs
The process of generating input-target pairs.


Training

Having all the input-target pairs generated, the P5 model is trained with the instruction tuning method optimizing the negative log-likelihood.

As mentioned, the P5 model utilizes the pretrained T5 checkpoints as backbone. Depending on the size of the base T5 model there are two P5 models:

  • P5-small (P5-S): 512 dimensions with 8-headed attention layers. It has 60M parameters in total.
  • P5-base (P5-B): 768 dimensions with 12-headed attention layers. It has 223M parameters in total.

Both models have a maximum length of 512 tokens. They are trained for 10 epochs using the AdamW optimizer with 1 × 10−3 as the peak learning rate where the warmup stage is set to 5% of all iterations.

Evaluation

To gauge performance, various metrics are used for each of the recommendation tasks:

  • Rating Prediction: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).
  • Sequential & Direct Recommendation: Normalized Discounted Cumulative Gain (nDCG@k) and Hit Ratio (HR@k).
  • Explanation & Review Summarization: BLEU-4, ROUGE-1, ROUGE-2, and ROUGE-L.

Let’s look at the performance in the Rating Prediction and Direct Recommendation tasks.

For the Rating Prediction task, the baselines to compare against are the Matrix Factorization (MF) and Wide and Deep (MLP) model under mean square root loss. For the Direct Recommendation task, the baselines are Bayesian Personalized Ranking (BPR) with Matrix Factorization (BPR-MF), BPR with Neural Net (BPR-MLP), and SimpleX. Let’s see the reported results and draw some conclusions.

Table for the P5 performance on the rating prediction task
P5 performance on the rating prediction task [1].


Table for the P5 performance on the direct recommendation task
P5 performance on the direct recommendation task [1].


For both tasks, the model’s evaluation uses seen and unseen data. Purple entries represent evaluation on seen data, where the test data includes entries from prompts used in training data generation. Orange entries indicate evaluation on unseen data, using test data generated from prompts never used for training.

Results show P5 outperforms BPR-MF and BPR-MLP significantly. While P5 demonstrates strong performance on top-1 metrics compared to SimpleX, it suggests that utilizing multi-task LLMs like P5 in recommendation systems is feasible and can surpass traditional models.

ChatGPT as RecSys

In a parallel work inspired by the P5 model, a group of authors used ChatGPT as a general purpose RecSys in a few-shot or zero-shot setting.

This approach explores whether extensive linguistic and world knowledge can be effectively transferred to recommendations. In a zero-shot scenario, the model receives no additional user data or interaction history. Conversely, the few-shot scenario enriches prompts with user interaction and potential interests, helping ChatGPT better understand user needs. This is illustrated in the image below.

Promts to use ChatGPT as RecSys in a zero and few-shot scenario
Promts to use ChatGPT as RecSys in a zero and few-shot scenario [2].


Model Architecture

The model architecture is very simple, and it is illustrated in the figure below. ChatGPT plays the central role of a recommender. There is an additional module called Output Refinement Module, which ensures that all the outputs follow the same structure.

ChatGPT as RecSys Scoring Function
ChatGPT as RecSys Scoring Function [2].


Experimental Setup and Evaluation

The experimental setup mirrors that of P5 (detailed above). Notably, human evaluation is introduced for Explanation Generation and Review Summarization tasks.

Table for the ChatGPT performance on the rating prediction task
ChatGPT performance on the Rating Prediction task [2].


Table for the ChatGPT performance on the direct recommendation task
ChatGPT performance on the Direct Recommendation task [2].


ChatGPT’s performance on Rating Prediction and Direct Recommendation tasks falls short of baseline models. This suggests that using ChatGPT in this manner for these specific tasks may not be optimal.

However, it’s interesting to note the subjective evaluation of ChatGPT on the Explanation Generation and Review Summarization tasks.

Table for the ChatGPT subjective evaluation on the Explanation Generation task
ChatGPT subjective evaluation on the Explanation Generation task [2].


Table for the ChatGPT subjective evaluation on the Review Summarization task
ChatGPT subjective evaluation on the Review Summarization task [2].


Although ChatGPT underperforms baseline models on Rating Prediction and Direct Recommendation, its outputs are judged clearer and more reasonable compared to P5 and ground-truth data. This suggests that leveraging more powerful LLMs could lead to better user adoption of recommendation system outputs.

Conclusion

In this blog post, we took a brisk trot tour through the RecSys landscape. We began with shallow models and deep learning-based approaches, then shifted to the emerging paradigm of generative recommendations using LLMs. Once again, we briefly covered various techniques to adapt LLMs for recommendations, highlighting two notable models: P5 and ChatGPT.

Evaluating P5 against traditional RecSys models reveals that P5 can outperform them, all while offering multi-tasking capabilities.

On the other hand, using LLMs like ChatGPT in a zero-shot setting is still not well-suited for direct recommendations. However, their recommendation summarization and explanation capabilities are far more preferred by human subjects compared to other competitors.

Thus we can conclude that integrating LLMs into recommendation systems marks a significant advancement, ushering in an era of generative recommendations that promise enhanced efficiency and sophistication. As these models are further explored and refined, the potential for personalized, context-aware recommendations will expand, transforming user experiences across the digital realm.

For more information, please follow me on LinkedIn. If you like this content you can subscribe to the mailing list below to get similar updates from time to time.


Acknowledgements

I would like to thank my colleagues at Frontiers, Tommaso Caneva and Sofya Lipnitskaya
for reviewing the content of this blog post.

References

  1. Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)
  2. Is ChatGPT a Good Recommender? A Preliminary Study
  3. Recommender Systems in the Era of Large Language Models (LLMs)
  4. Large Language Models for Generative Recommendation: A Survey and Visionary Discussions
  5. How Can Recommender Systems Benefit from Large Language Models: A Survey
  6. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Leave a comment