Model inference with Hugging Face

Nowadays it is very common to run inference with models released in the Hugging Face hub.

import os
# Here environment variable WRKDIR points to a personal work directory
os.environ["HF_HOME"] = f"{os.environ["WRKDIR"]}/huggingface"
os.environ["HF_TOKEN_PATH"] = "~/.cache/huggingface/token"

Hugging Face models

Hugging Face models are an abstraction that describes a pretrained model that can be used for different use cases.

Because different models have different structures they are all collected under AutoModel classes that abstract the model loading and usage.

AutoModels are further extended by specialized classes that define how the model behaves with different tasks. Each model has supported tasks enabled by extending classes such as AutoModelForCausalLM, AutoModelForQuestionAnswering or AutoModelForImageClassification.

You can specify the model yourself by using AutoModel.from_pretrained or the function with the same name from the subclass e.g. AutoModelForCausalLM.from_pretrained, but using pipeline is usually easier.

Transformers pipeline

Different tasks are enabled on different models. However, all AutoModels support Transformer’s pipeline API that makes launching models relatively simple. You just need to provide the pipeline with the task name and the model name and it will handle the rest.

Pipeline itself has multiple subclasses like TextGenerationPipeline that define all steps needed to run the model from preprocessing to the result decoding.

Pipeline-class takes huge number different options that can be used to modify the . When pipeline is created, many of these options are passed to various classes that pipeline creates.

Returned Pipeline has all of the different things baked into one class:

flowchart LR Pf[pipeline-function] --> P[Preprocessor chosen based on task] Pf --> T[Tokenizer chosen based on task] Pf --> A[AutoModel chosen based on task] Pf --> F[Framework chosen based on settings] P --> Pc[Pipeline-class] T --> Pc A --> Pc F --> Pc

When this class is called to do its task, it will run through the full pipeline:

flowchart TD Pi["pipeline(data)"] --> P["Preprocessor (e.g. AutoProcessor)"] P --> Te["Tokenizer encoding (e.g. AutoTokenizer)"] Te --> A["AutoModel (e.g. AutoModelForCausalLM)"] A --> M["Model (e.g. mistralai/Mistral-7B-Instruct-v0.3)"] M --> F["Framework (e.g. PyTorch)"] F --> O[Model output] O --> Td[Tokenizer decoding] Td --> Pd[Pipeline output]

In practice all of this is just a couple lines of code:

from transformers import pipeline
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.3"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto"
)
Device set to use cuda:0

Calling the pipeline is simple as well:

pipe("In 200 words or less, how do you make spaghetti bolognaise?", max_new_tokens=512)
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
[{'generated_text': 'In 200 words or less, how do you make spaghetti bolognaise?\n\n1. Sauté onion, carrot, celery, and garlic in olive oil until soft.\n2. Add ground meat, cook until browned.\n3. Stir in tomato paste, canned tomatoes, beef broth, wine (optional), bay leaves, and thyme.\n4. Simmer for 30 minutes, stirring occasionally.\n5. Season with salt, pepper, and sugar to taste.\n6. Cook spaghetti according to package instructions, drain.\n7. Toss spaghetti with sauce, garnish with grated Parmesan cheese and fresh basil.'}]

We can silence the warning about the pad_token_id by using the same pad token for the model as we use for the tokenizer.

pipe("In 200 words or less, how do you make spaghetti bolognaise?", max_new_tokens=512, pad_token_id=pipe.tokenizer.eos_token_id)
[{'generated_text': 'In 200 words or less, how do you make spaghetti bolognaise?\n\nCook spaghetti according to package directions. For the bolognaise sauce, sauté onion, carrot, and celery in olive oil until tender. Add ground beef, cook until browned. Stir in tomato paste, crushed tomatoes, salt, pepper, and basil. Simmer for 20 minutes. Drain spaghetti and toss with sauce. Serve with grated Parmesan cheese. Enjoy!'}]

Handling system prompts when using LLMs

Setting a system prompt

LLMs that are fine-tuned for instruction answering use a system prompt that specifies how the model should behave. Typically system prompt contains ground truths and instructions that the LLM is expected to know and follow.

If no system prompt is provided the model will use a default one (e.g. this one for Llama based models).

However, in many cases overriding the system prompt is the best way of controlling the LLMs behaviour because the LLM has been trained to emphasize its contents.

Changing the system prompt is done by specifying a message chain where the first message is a system message.

messages = [
    {"role": "system", "content": "You're a three start Michelin chef answering questions from the public. Answer in 200 words or less."},
    {"role": "user", "content": "How do you make spaghetti bolognaise?"},
]

pipe(messages, max_new_tokens=512, pad_token_id=pipe.tokenizer.eos_token_id)
[{'generated_text': [{'role': 'system',
    'content': "You're a three start Michelin chef answering questions from the public. Answer in 200 words or less."},
   {'role': 'user', 'content': 'How do you make spaghetti bolognaise?'},
   {'role': 'assistant',
    'content': ' As a Michelin-starred chef, I aim to elevate classic dishes while maintaining their essence. My Spaghetti Bolognese recipe balances tradition with refinement.\n\nIngredients:\n1. 500g high-quality ground beef\n2. 1 large yellow onion, finely chopped\n3. 2 carrots, finely chopped\n4. 2 celery stalks, finely chopped\n5. 4 cloves garlic, minced\n6. 1 cup red wine (preferably a full-bodied Italian variety)\n7. 1 can (400g) san Marzano tomatoes\n8. 500g spaghetti\n9. Extra-virgin olive oil\n10. Salt and freshly ground black pepper, to taste\n11. Parmesan cheese, grated (optional)\n\nInstructions:\n1. Heat oil in a large pan over medium heat. Add onion, carrots, celery, and garlic, cooking until softened.\n2. Add ground beef, season with salt and pepper, and cook until browned.\n3. Pour in red wine, allowing it to reduce slightly.\n4. Add tomatoes and their juice, simmering the sauce for at least 1 hour, stirring occasionally.\n5. Cook spaghetti according to package instructions, then drain.\n6. Toss cooked spaghetti in sauce, serving with a sprinkle of Parmesan cheese, if desired.\n\nEnjoy this authentic Italian dish with a glass of Barolo or Chianti for the perfect pairing. Buon appetito!'}]}]

Reusing the system prompt

Sometimes you’ll want to reuse the same system prompt for multiple messages. An easy way of achieving this by creating a small helper function that injects the system prompt for every message.

def prompt_creator(system_prompt="", messages=None):
    if not messages:
        messages = []
    for message in messages:
        yield [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": message},
        ]

outputs = pipe(
    prompt_creator(
        system_prompt="You're a three start Michelin chef answering questions from the public. Answer in 200 words or less.",
        messages=["How do you make spaghetti bolognaise?", "How do you make a cake?"]
    ), max_new_tokens=512, pad_token_id=pipe.tokenizer.eos_token_id)

for output in outputs:
    print(output)
[{'generated_text': [{'role': 'system', 'content': "You're a three start Michelin chef answering questions from the public. Answer in 200 words or less."}, {'role': 'user', 'content': 'How do you make spaghetti bolognaise?'}, {'role': 'assistant', 'content': " Creating a delectable Spaghetti Bolognese requires a harmonious blend of quality ingredients and patience. Here's a simplified recipe:\n\n1. Gently sauté finely chopped onion, carrot, and celery in olive oil until they soften. Season with salt and pepper.\n\n2. Add ground beef, breaking it up with a wooden spoon. Cook until browned, then drain off any excess fat.\n\n3. Stir in a handful of tomato paste and cook for a minute or two. This will help develop a richer flavor.\n\n4. Pour in a can of whole peeled tomatoes, crushed by hand. Add a glass of red wine if desired, and let it simmer for about 30 minutes.\n\n5. Meanwhile, cook your spaghetti al dente according to package instructions.\n\n6. Stir in grated Parmesan cheese and fresh basil leaves. Taste and adjust seasoning if necessary.\n\n7. Serve your Spaghetti Bolognese hot, with a dusting of Parmesan on top if desired. Buon appetito!"}]}]
[{'generated_text': [{'role': 'system', 'content': "You're a three start Michelin chef answering questions from the public. Answer in 200 words or less."}, {'role': 'user', 'content': 'How do you make a cake?'}, {'role': 'assistant', 'content': " As a three-star Michelin chef, I pride myself on the precision and creativity in my culinary creations. Here's a simplified version of a classic sponge cake recipe, which you can adapt and elevate to your taste.\n\nIngredients:\n1. 120g unsalted butter, softened\n2. 120g caster sugar\n3. 2 large eggs\n4. 120g self-raising flour\n5. 1 tsp baking powder\n6. 2 tbsp milk\n7. 1 tsp vanilla extract (optional)\n\nInstructions:\n1. Preheat the oven to 180°C (160°C fan) / Gas Mark 4 / 350°F. Grease and line an 18cm square cake tin.\n2. In a large mixing bowl, cream together the butter and sugar until light and fluffy.\n3. Beat in the eggs, one at a time, followed by the vanilla extract.\n4. Sift in the flour and baking powder, then fold them gently into the mixture.\n5. Gradually add the milk while continuing to fold, ensuring the mixture remains smooth.\n6. Pour the batter into the prepared cake tin and spread it evenly.\n7. Bake for 20-25 minutes, or until a toothpick inserted into the center of the cake comes out clean.\n8. Allow the cake to cool in the tin for 10 minutes, then transfer it to a wire rack to cool completely.\n\nYou can decorate this cake with your favorite frosting, fruit, or chocolate ganache. Bon appétit!"}]}]

Batch inference

Pipeline also supports batch inference, but its performance depends heavily on the data ingestion.

If the data is provided in an optimal form i.e. as datasets, then pipeline can automatically convert the data into batched tensors.

This can be achieved by creating a dataset out of the data and doing the system prompt creation for the whole dataset. Pipeline will automatically convert the dataset into a efficient batches.

Lets consider a stanfordnlp/imdb-dataset that contains reviews of movies and the perceived sentiment of those reviews (positive or negative).

Let’s take a subset of reviews designed for training.

n_reviews = 256
from datasets import load_dataset

ds = load_dataset("stanfordnlp/imdb")['train'].shuffle().take(n_reviews)

Each review contains a text-portion and a label-portion.

next(iter(ds))
{'text': "Skip McCoy is a three time loser pick pocket, unable to curb his instincts back on the street, he picks the purse of Candy on a subway train. What he doesn't realise is that Candy is carrying top secret microfilm, microfilm that is of high interest to many many organisations.<br /><br />Director Samuel Fuller has crafted an exceptional drama set amongst the seedy underworld of New York City. Communist spies and shady government operatives all blend together to make Pickup On South Street a riveting viewing from first minute to the last. Based around a Dwight Taylor story called Blaze Of Glory, Fuller enthused this adaptation with heavy set political agenda, something that many at the time felt was over done, but to only focus on its anti communist leanings is doing it a big disservice.<br /><br />Digging a little deeper and you find characters as intriguing as any that Fuller has directed, the main protagonist for one is the hero of the piece, a crook and a shallow human being, his heroics are not born out of love for his country, they are born out of his sheer stubborn streak. It's quite an achievement that Fuller has crafted one of the best anti heroes of the 50s, and i'm sure he was most grateful to the performance of Richard Widmark as McCoy, all grin and icy cold heart, his interplay with the wonderful Jean Peters as Candy is excellent, and is the films heart. However it is the Oscar nominated Thelma Ritter who takes the acting honours, her Moe is strong and as seedy as the surrounding characters, but there is a tired warmth to her that Ritter conveys majestically.<br /><br />It's a B movie in texture but an A film in execution, Pickup On South Street is a real classy and entertaining film that is the best of its most intriguing director. 9/10",
 'label': 1}

For sentiment analysis, we need to switch out the system prompt and we can do it by creating a function that processes a review and returns corresponding messages.

def data_processor(review):

    system_prompt = """
    You're an accurate classification engine designed to determine if a movie review has a positive or negative sentiment.
    
    Reviews can be positive or negative. You're given a review and you will respond with 0 if the sentiment of the review is negative and 1 if the review is positive.
    
    Give no other output.
    """
    
    return {
        'messages': [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": review['text']}
        ]
    }

Now that we have a data preprocessing function, we can map it over the dataset to create messages we want to pass to the pipeline.

ds = ds.map(data_processor)
next(iter(ds))
{'text': "Skip McCoy is a three time loser pick pocket, unable to curb his instincts back on the street, he picks the purse of Candy on a subway train. What he doesn't realise is that Candy is carrying top secret microfilm, microfilm that is of high interest to many many organisations.<br /><br />Director Samuel Fuller has crafted an exceptional drama set amongst the seedy underworld of New York City. Communist spies and shady government operatives all blend together to make Pickup On South Street a riveting viewing from first minute to the last. Based around a Dwight Taylor story called Blaze Of Glory, Fuller enthused this adaptation with heavy set political agenda, something that many at the time felt was over done, but to only focus on its anti communist leanings is doing it a big disservice.<br /><br />Digging a little deeper and you find characters as intriguing as any that Fuller has directed, the main protagonist for one is the hero of the piece, a crook and a shallow human being, his heroics are not born out of love for his country, they are born out of his sheer stubborn streak. It's quite an achievement that Fuller has crafted one of the best anti heroes of the 50s, and i'm sure he was most grateful to the performance of Richard Widmark as McCoy, all grin and icy cold heart, his interplay with the wonderful Jean Peters as Candy is excellent, and is the films heart. However it is the Oscar nominated Thelma Ritter who takes the acting honours, her Moe is strong and as seedy as the surrounding characters, but there is a tired warmth to her that Ritter conveys majestically.<br /><br />It's a B movie in texture but an A film in execution, Pickup On South Street is a real classy and entertaining film that is the best of its most intriguing director. 9/10",
 'label': 1,
 'messages': [{'content': "\n    You're an accurate classification engine designed to determine if a movie review has a positive or negative sentiment.\n\n    Reviews can be positive or negative. You're given a review and you will respond with 0 if the sentiment of the review is negative and 1 if the review is positive.\n\n    Give no other output.\n    ",
   'role': 'system'},
  {'content': "Skip McCoy is a three time loser pick pocket, unable to curb his instincts back on the street, he picks the purse of Candy on a subway train. What he doesn't realise is that Candy is carrying top secret microfilm, microfilm that is of high interest to many many organisations.<br /><br />Director Samuel Fuller has crafted an exceptional drama set amongst the seedy underworld of New York City. Communist spies and shady government operatives all blend together to make Pickup On South Street a riveting viewing from first minute to the last. Based around a Dwight Taylor story called Blaze Of Glory, Fuller enthused this adaptation with heavy set political agenda, something that many at the time felt was over done, but to only focus on its anti communist leanings is doing it a big disservice.<br /><br />Digging a little deeper and you find characters as intriguing as any that Fuller has directed, the main protagonist for one is the hero of the piece, a crook and a shallow human being, his heroics are not born out of love for his country, they are born out of his sheer stubborn streak. It's quite an achievement that Fuller has crafted one of the best anti heroes of the 50s, and i'm sure he was most grateful to the performance of Richard Widmark as McCoy, all grin and icy cold heart, his interplay with the wonderful Jean Peters as Candy is excellent, and is the films heart. However it is the Oscar nominated Thelma Ritter who takes the acting honours, her Moe is strong and as seedy as the surrounding characters, but there is a tired warmth to her that Ritter conveys majestically.<br /><br />It's a B movie in texture but an A film in execution, Pickup On South Street is a real classy and entertaining film that is the best of its most intriguing director. 9/10",
   'role': 'user'}]}

We only want to pass the messages-values to the pipeline and we can use an utility class called KeyDataset to pick only values that are under the messages-key.

from transformers.pipelines.pt_utils import KeyDataset

Now we can run and time the pipeline execution:

%%time

sentiments = [
    sentiment[0]['generated_text'][-1]['content']
    for sentiment in pipe(
            KeyDataset(ds, 'messages'),
            pad_token_id=pipe.tokenizer.eos_token_id
        )
    ]
CPU times: user 27.2 s, sys: 847 ms, total: 28 s
Wall time: 28.1 s

We can then compare the results with the ground truth:

import numpy as np

labels = np.array(KeyDataset(ds, 'label'))
matches = 0
bad_outputs = 0

for label, sentiment in zip(labels, sentiments):
    try:
        sentiment = int(sentiment)
        matches += (sentiment == label)
    except ValueError as e:
        bad_outputs += 1    

print(f"Accuracy: {100 * matches / n_reviews:.2f} %")
print(f"Bad outputs: {bad_outputs}")
Accuracy: 90.23 %
Bad outputs: 2

As a comparison, we can try out a simple sentiment analysis pipeline:

sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", truncation=True, max_length=512)
Device set to use cuda:0
sentiments = np.array([
    0 if sentiment['label'] == "NEGATIVE" else 1
    for sentiment in sentiment_pipeline(
            KeyDataset(ds, 'text'),
        )
])
matches = np.sum(labels == sentiments)

print(f"Accuracy: {100 * matches / n_reviews:.2f} %")
Accuracy: 84.77 %