Using Transformers

Pipeline functions

Let’s see what happens when we use the sentiment analysis using the Pipeline function.

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

Stages of Pipeline function

There are three stages in a pipeline function: Tokenizer, Model and Post Processing

Tokenizer Stage

Text is split into tokens.
Tokenizers will add some special tokens: [CLS] and [SEP]
Tokenizer matches each token with the unique ID in the vocab of the pre-trained model. AutoTokenizer method in HF is used here.

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Tokenizer can add padding and truncation to create tensors of same length.

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

Model stage:

Download the configuration of the model as well as the pre-trained weights of the models
The AutoModel class loads a model without its pretraining head which means it will return a high dimensional tensor that is representation of sentences but not directly helpful in classifcation task.

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Use AutoModelForSequenceClassification for the classification task. This returns the logits.

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Postprocessing stage:

Apply softmax layer to transform logits into probabilities

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

Use id2label method for converting logits into labels

model.config.id2label

Reference

Hugging Face

Using Transformers

Using Transformers

Pipeline functions

Stages of Pipeline function

Tokenizer Stage

Model stage:

Postprocessing stage:

Reference

Further Reading

Explaining Raft

Lora Vs Full Fine Tuning

Exploratory Data Analysis For Rag