## Loading Dependencies


In [1]:
%pip install --upgrade pip
%pip install torch==1.13.1 \
    torchdata==0.5.1

%pip install \
    transformers \
    datasets \
    evaluate \
    rouge_score \
    loralib \
    peft \
    py7zr

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
Collecting torch==1.13.1
  Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Collecting torchdata==0.5.1
  Downloading torchdata-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch==1.13.1)
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu11==8.5.0.96 (from torch==1.13.1)
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu11==11.10.3.66 (from t

## Imports


In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

## Loading the Dataset

In [3]:
dataset_name = "samsum"

dataset = load_dataset(dataset_name)

dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/6.06M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/347k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/335k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) (It was finetuned on a variety of tasks) and its tokenizer from HuggingFace.

In [4]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.


## Test the Model with Zero Shot Inferencing


In [None]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

Abdellilah: Where are you?
Sam: work
Abdellilah: What time you finish?
Sam: Not til 5
Abdellilah: Are your bringing him over tonight:
Sam: No in the morning:
Abdellilah: ok, what time?
Sam: About 9. Is that ok?
Abdellilah: ok - see you then

Summary:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Sam won't finish work till 5. Sam is bringing him over about 9 am. Sam will see Abdellilah in the morning. 

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
Sam is at work. He finishes at 5 and is not bringing Abdellilah over tonight. Sam will bring Abdellilah to work at about 9.



## Preprocess the SAMSUM Dataset

We'll be adding the prompt "Summarize the following conversation." and "Summary:" before and after the dialogue respectively for tokenization. We will be using a subset of SAMSUM dataset for fine-tuning our model.


In [None]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset contains 3 different splits. Tokenize function is handling all of these splits
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'dialogue', 'summary',])
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 12 == 0, with_indices=True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Filter:   0%|          | 0/14732 [00:00<?, ? examples/s]

Filter:   0%|          | 0/819 [00:00<?, ? examples/s]

Filter:   0%|          | 0/818 [00:00<?, ? examples/s]

In [None]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (1228, 2)
Validation: (69, 2)
Test: (69, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1228
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 69
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 69
    })
})


The output dataset is ready for fine-tuning.


## Perform Parameter Efficient Fine-Tuning (PEFT)




### Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

# Setting up the configuration
lora_config = LoraConfig(
    r=32, # Rank of the low-rank matrices
    lora_alpha=32, # Similar to learning rate
    target_modules=["q", "v"], # Targeting query and key layers
    lora_dropout=0.05, # Similar to dropout in neural networks
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5 task type
)

peft_model = get_peft_model(original_model,
                            lora_config)


### Defining the training argument and the trainer instance

In [None]:
output_dir = f'./peft-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True, # Automatically computes the largest batch size possible
    learning_rate=1e-3, # Will be higher compared to LR for finetuning
    weight_decay=0.01,
    num_train_epochs=10,
    logging_steps=50,
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [None]:
# !CUDA_LAUNCH_BLOCKING=1
time1 = time.time()
peft_trainer.train()
time2 = time.time()

training_time = time2 - time1

print(f'Time taken to train the model for 10 epochs using LoRA is: {training_time} seconds')
peft_model_path="./peft-weights"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Step,Training Loss
50,9.6777
100,0.3945
150,0.199
200,0.1571
250,0.142
300,0.1294
350,0.126
400,0.1204
450,0.1211
500,0.1117


Time taken to train the model for 10 epochs using LoRA is: 3571.267054080963 seconds


('./peft-weights/tokenizer_config.json',
 './peft-weights/special_tokens_map.json',
 './peft-weights/spiece.model',
 './peft-weights/added_tokens.json',
 './peft-weights/tokenizer.json')

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [11]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       peft_model_path,
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)


## Evaluating the Model Qualitatively

Here you can play around and see how the models performance has changed after fine tuning.

In [12]:
original_model.to('cpu')

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

In [13]:
dash_line = '-'.join('' for i in range(100))

In [31]:
index = 75
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(f'PROMPT: \n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{baseline_human_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'PEFT MODEL:\n{peft_model_text_output}')

PROMPT: 

Summarize the following conversation.

Steve: BTW, USA won last night!
Gulab: I forgot to check!
Steve: England playing tomorrow at 2:00!
Gulab: That's right, Croatia?
Steve: Yep.

Summary: 
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
USA won last night. England is playing against Croatia tomorrow at 2.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Steve and Gulab are going to watch England play tomorrow at 2:00.
---------------------------------------------------------------------------------------------------
PEFT MODEL:
Steve and Gulab are discussing the USA's win last night. England will play tomorrow at 2:00.


## Evaluate the Model Quantitatively (with ROUGE Metric)

We are using just first 10 summaries for measuring the performance to save time. We will be using the ROUGE metric as this is a summarization task. We have already seen how the ROUGE metrics work in the [Decipher LLMs](https://learnopencv.com/deciphering-llms/) blog post.  

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Hannah needs Betty's number but Amanda doesn't...,Amanda can't find Betty's number. Hannah asks ...,Hannah can't find Betty's number. Amanda will ...
1,Eric and Rob are going to watch a stand-up on ...,Eric and Rob are talking about the Russian Rus...,Eric and Rob are watching a video of Eric's st...
2,Lenny can't decide which trousers to buy. Bob ...,Lenny wants to buy a pair of purple trousers. ...,Lenny wants to buy two pairs of purple trouser...
3,Emma will be home soon and she will let Will k...,Will is going to be home soon. Emma is not hun...,Emma will be home soon. Will will pick her up.
4,Jane is in Warsaw. Ollie and Jane has a party....,Ollie is free for the 18th and the 19th. Jane ...,Jane lost her calendar. Ollie is in Warsaw. Ja...
5,Hilary has the keys to the apartment. Benjamin...,Hilary has the keys to the conference hall. Hi...,Hilary has keys. Hilary will meet them at lunc...
6,Payton provides Max with websites selling clot...,Payton likes buying clothes from a lot of webs...,Payton likes shopping but not always buying. M...
7,Rita and Tina are bored at work and have still...,Rita is tired and a bit sleepy at work. Tina i...,Rita is tired at work. She is looking at the c...
8,"Beatrice wants to buy Leo a scarf, but he does...","Beatrice is in town, shopping. She has a scarf...",Beatrice is in town shopping. He doesn't want ...
9,Eric doesn't know if his parents let him go to...,Ivan is coming to the wedding. Eric has a lot ...,Ivan is coming to Eric's brother's wedding. Er...


In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries, # Summaries generated using the base model
    references=human_baseline_summaries[0:len(original_model_summaries)], # Reference summaries by humans
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries, # Summaries generated using the fine-tuned model
    references=human_baseline_summaries[0:len(peft_model_summaries)], # Reference summaries by humans
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ORIGINAL MODEL:
{'rouge1': 0.42889770784823256, 'rouge2': 0.1705213408281177, 'rougeL': 0.3118532691066424, 'rougeLsum': 0.313564711150918}
PEFT MODEL:
{'rouge1': 0.468268254301067, 'rouge2': 0.23619570506455861, 'rougeL': 0.39021428273346437, 'rougeLsum': 0.3897312655420616}


Notice, that PEFT model results are better than the original model, even after doing full finetuning the results would have been more or less the same as the PEFT models results. But performing full finetuning would require way more compute resources, we were able to perform full fine tuning using just a single GPU in this notebook.