Reading Notes | Universal and Transferable Adversarial Attacks on Aligned Language Models

Posted on December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

2023-12-07: First Draft.

Overview

This work tries to add an automatically optimized suffix to the input instruction so that the LM will follow the unsafe instruction to generate unsafe content.

Specifically, suppose the length of {affirmation} is $a$ , the algorithm do the following:
– Iterating over $1, 2, \cdots, t$ :
– Forward pass with model using string, the model will output logits of length $(a, \vert \mathcal{V}\vert)$ .
– Compute the cross-entropy loss of these logits and true token IDs (think of it of a $\vert \mathcal{V}\vert$ -class classification problem).
– Backprogation loss back to the tokens in {suffix i-1}, we could select tokens with highest gradient to replace and obtain {suffix i}.

Finally, we use the optimized {suffix t} to put it into test and hope that it will generate the {affirmation} token.

# train
BEGINNING OF CONVERSATION: USER: {in} {suffix 0} ASSISTANT: {affirmation}
BEGINNING OF CONVERSATION: USER: {in} {suffix 1} ASSISTANT: {affirmation}
BEGINNING OF CONVERSATION: USER: {in} {suffix 2} ASSISTANT: {affirmation}
...
BEGINNING OF CONVERSATION: USER: {in} {suffix 2} ASSISTANT: {affirmation}

# test
BEGINNING OF CONVERSATION: USER: {in} {suffix t} ASSISTANT:

Basics

PPO is the extension of the classical policy gradient algorithm (therefore, on-policy) by updating multiple steps rather than only one step. Suppose we have a reward model $r _ \theta(x, y)$ , the PPO tries to update the LM parameters $\phi$ so that the cumulative reward is maximized

The following is the steps of PPO (taken from Hyunwon Chung’s talk):

Step 1: Obtaining a SFT model using the standard LM loss.
Step 2: Repeat the following:
- Sampling: Sampling prompts from the datasets.
- Rollout: Generating responses with the current version of LM $\pi _ \phi ^ \mathrm{RL}$ .
- Evaluation: Using the (fixed) reward model $r _ \theta$ to score each of the response in the last step.
- Optimization: Using the (prompt, continuation, score) triplets as a dataset to optimize the parameter (i.e., $\phi$ ) of the LM.

These steps are written concisely (yet confusingly) in the original paper as follows. The first boxed term is used to prevent overfitting to the reward function; the second boxed term is to reduce the performance regression on the standard benchmarks.
$\mathbb{E} _ {x, y \sim D _ {\pi ^ \mathrm{RL}}} \left[ r _ \theta(x, y) – \boxed{\beta \cdot \log \frac{\pi _ \phi ^ \mathrm{RL}(y\vert x)}{\pi^\mathrm{SFT}(y\vert x)}}\right] + \boxed{\gamma \mathbb{E} _ {x \sim D} \left[ \log \pi _ \phi ^ \mathrm{RL}(x) \right]}$

Method

This paper is motivated by the observation that the aligned LM will still generate unsafe content if we could make the first few words of the LM response something like “Sure, here is how $UNSAFE_CONTENT”.

Therefore, the idea is to disguise the input prompt with an automatically optimized suffix so that an aligned LM has a similar loss as

Note that selecting replacement by loss makes sense because the RLHF maximizes the reward while staying as close as the original SFT model.

Code Anatomy

The codebase is designed for chat models that involve different “roles” in the format of tuples. It is necessary to adapt the codebase to make it work with plain text models.

The most complicated part of the codebase is how the authors handle different prompt template in various language models; these messy details are all included in llm_attacks.minimal_gcg.string_utils.SuffixManager. What makes things more complicated is the fact that these string processing utilities in turn depend on fastchat library.

Three key variables in the demo specific to LLaMA2-Chat model are manager._control_slice, manager._loss_slice, and manager._target_slice. These three variables are derived from hidden variables self._assistant_role_slice and self._user_role_slice; they are fixed throughout the attack.

The attack discussed in the paper works best with greedy decoding (the default approach in model.generate()). One may develop special decoding methods geared towards safety.

Reading Notes | Certifying LLM Safety against Adversarial Prompting

Posted on December 11, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

2023-12-07: First draft. The corresponding authors include Soheil Feizi (UMD) and Himabindu Lakkaraju (Harvard).

The intuition of the proposed certified LLM safety is quite simple: the complete sequence is safe if all of its subsequences are also safe.

However, one issue with this notion of safety is that it relies on the capability of the safety classifier: if the classifier systematically fail, then the certificate is broken.

Reading Notes | ToxiGen – A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Posted on December 6, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Log

2023-12-05: First draft.

Overview

The authors propose a method to automatically generate a balanced dataset (13 identity groups and both toxic and benign) of 14K (using ALICE) + 260K (using demonstraionts) = 274K samples without explicit words based on the following two observations:

It is hard to collect hard toxic content to augment the training set of machine learning models as overly toxic content often co-occur with small set of explicit words.
Furthermore, the explicit mention (for example, Muslim) of language styles (for example, African-American English) of some identity groups are unfairly classified as toxic by existing models.

Method

ALICE

The authors incoporate a binary hate speech classifier’s score on the “hate” or “non-hate” class into the decoding process to encourge more hateful or more non-hateful generation given a prompt.

Originally, the hateful prompt will lead to hateful continuation. However, when we have the classifier in the loop, the continuation’s hatefulness will be mitigated yet not reversed, leading to implicit hate speech (i.e., hard toxic content).

Demonstration

Another method the authors propose is manually collecting implicit hate speech from the web, and then demonstrate to obtain more texts from GPT-3. This effort lead to 260K samples.

Experiments

Data Augmentation with ToxiGen Improves Accuracy on OOD Test Sets
The authors further fine-tune HateBERT and ToxDectRoBERTa using the collected dataset and test it on social_bias_frames, SALT-NLP/ImplicitHate, and aps/dynahate. The authors observe improved accuracy after fine-tuning.

Reading Notes | Direct Preference Optimization – Your Language Model is Secretly a Reward Model

Posted on December 6, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video]

Change Logs:

2023-12-04: First draft.

Overview

DPO belongs to a larger family of algorithms of directly optimizing human preferences. The algorithm assumes there are always a winning solution and a losing solution; this is different from PPO as now the label becomes discrete.
Using DPO will alleviate the need for a dedicated library such as trl. The only change need to be made is a loss function.

Reference

DPO Debate: Is RL needed for RLHF? – YouTube: This is an advanced video by Nathan Lambert.
[2310.12036] A General Theoretical Paradigm to Understand Learning from Human Preferences (DeepMind)
This is a theoretical paper that reveals the limitations of DPO.

It shows two assumptions of RLHF: (1) pairwise comparison could be substituted with pointwise rewards, and (2) an LM trained on the pointwise rewards will generalize from collected data to OOD data.

Research Notes | Benchmarking LLM Safety

Posted on November 30, 2023December 11, 2023 by David Yang

Problem Description

When receiving a prompt that queries for unsafe information (for example, toxic, profane, and legal / medical information), the LM may respond and cause harm in the physical world. There are several ways to diagnose LM weakness:

Static Benchmark: This includes the CheckList-style challenge test sets.
- Benchmark Saturation and Annotation Bias
- Concept shift: For example, the same content previously thought non-toxic became toxic after certain social event.
- Covariate Shift: This includes (1) the emerging unsafe categories and (2) changing proportion of existing unsafe categories.
Red-Teaming
- Manual Red-Teaming: Leveraging people’s creativity to search for prompts that may elicit unsafe behaviors of LLMs.
- Automated Red-Teaming: Using automated search to deviate the region guarded by RLHF so that the unsafe content will be generated.

Note that

The description above only considers the language model itself. There may be external input / output filters that assist the detection and mitigation of unsafe behaviors; these external filters should bcde studies separately.
The LM itself may or may not go through the process of enhancing safety. The methods to enhance safety may include (1) SFT with additional (unsafe prompt, IDK response) or (2) RLHF with additional (unsafe prompt, IDK response, unsafe response); here IDK resposne is generic responses that LMs fall back to when encountering unsafe prompts.

Red Teaming

Resources

An comprehensive wiki and a collection of resources from Yaodong Yang @ PKU. He, together with Songchun Zhu, also writes a comprehensive survey on AI alignment; it has a Chinese version.

Reference

Safety Alignment

[2310.12773] Safe RLHF: Safe Reinforcement Learning from Human Feedback

[2307.04657] BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset (PKU-Alignment)

This work find that separately annotating harmlessness and helpfulness (with the additional safe RLHF algorithm proposed in 1) substantially outperforms Anthropic’s baselines; the authors claim that they are the first to do this. The author also open-source the datasets (1) a SFT (or classification) dataset that is used to train safety classifier and (2) a RLHF dataset that is used to fine-tune an LM (Alpaca in the paper).

The authors also curate a balanced test set from 14 categories to measure some models’ safety (Figure 5), they find that LLMs with alignment show much less variance among GPT-4, human evaluation, and QA moderation. Here “QA moderation” is another measure for hatefulness: the degree to which a response mitigate the potential harm of a harmful prompt; the authors use the binary label for this. Specifically, rather than using each single sentence’s own toxicity as label (for example, prompt or response) the authors use whether a response addresses the prompt harmlessly as the label.

Note that the authors synthesize 14 categories from 1, 2 in “Taxonomy” and 1 in “Red Teaming.” The authors acknowledge that these categories are not MECE.

The authors release their models and datasets on HuggingFace hub:

Model	Name	Note
1	`PKU-Alignment/alpaca-7b-reproduced`	The reproduced Alpaca model.
2	`PKU-Alignment/beaver-dam-7b`	A LLaMA-based QA moderation model
3	`PKU-Alignment/beaver-7b-v1.0-reward`	The static reward model during RLHF
4	`PKU-Alignment/beaver-7b-v1.0-cost`	The static cost model during RLHF
5	`PKU-Alignment/beaver-7b-v1.0`	The Alpaca model that goes through the safe RLHF process based on 1

Dataset	Name	Note
1	`PKU-Alignment/BeaverTails`	A classification dataset with `prompt`, `response`, `category`, and `is_safe` columns; it could be used for 14 classes (if using `category`) or 2 classes (if using `is_safe`).
2	`PKU-Alignment/BeaverTails-single-dimension-preference`	A preference dataset with `prompt`, `response_0`, `response_1`, and `better_response_id` (-1, 0, 1).
3	`PKU-Alignment/BeaverTails-Evaluation`	It only has `prompt` and `category` columns. It is not the test split of the dataset 1 and 2.
4	`PKU-Alignment/PKU-SafeRLHF`	A preference and binary classification dataset (N=330K) with `prompt`, `response_0`, `response_1`, `is_response_0_safe`, `is_response_1_safe`, `better_response_id`, `safer_response_id`; it has both training and test split.
5	`PKU-Alignment/PKU-SafeRLHF-30K`	Sampled version of 4 with both training and test split.
6	`PKU-Alignment/PKU-SafeRLHF-10K`	A further sampled version of 4 with only training split available.
7	`PKU-Alignment/processed-hh-rlhf`	A reformatted version of the Anthropic dataset for the ease of use; the original dataset is formatted in plain text.

Safety Benchmark

[2308.01263] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models (Röttger et al.): This work presents a small set of test prompts (available on GitHub) that could be used to test the safety of an LLM. This work is from the people working on hate speech, including Paul Röttger, Bertie Vidgen, and Dirk Hovy.
[2308.09662] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (DeCLaRe Lab, SUTD): This work provides two datasets: (1) a set of hateful questions for safety benchmarking, and (2) (propmt, blue conversation, red conversation) datasets for safety benchmarking.
[2309.07045] SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions (Tsinghua): This work provides a dataset of multiple-choice QA to evaluate the safety of an LLM across 7 predefined categories, including offensiveness, bias, physical health, mental health, illegal activities, ethics, and privacy.

OOD and Safety

[2311.14743] A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift (Scale AI)

Red Teaming

[2209.07858] Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al., Anthropic).
[2202.03286] Red Teaming Language Models with Language Models (Perez et al., DeepMind and NYU)

Taxonomy of Unsafe Behaviors

[2206.08325] Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models (Rauh et al., DeepMind)
BBQ: A hand-built bias benchmark for question answering (Parrish et al., Findings 2022, NYU)

Controlled Text Generation

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (Hartvigsen et al., ACL 2022)
The authors propose a classifier-in-the-loop constrained decoding scheme that allows for the generation of benign and (implicit) toxic content of 13 minority groups.

Specifically, the authors adjust the token distribution by adding the a partial sequence’s neutral class probability from a hate speech classifier to mitigate the toxicity every step. This will make the original explicitly toxic content less toxic (from 66% to 43%) yet still implicitly toxic. Besides making implicit toxic content, this approach could also work with a benign prompt to generate benign content.
[2310.14542] Evaluating Large Language Models on Controlled Generation Tasks (Sun et al., EMNLP)
This paper shows that LLMs, including gpt-3.5-turbo, Falcon, Alpaca, and Vicuna, could not be controlled to follow fine-grained signal such as numerical planning (for example, “generate a paragraph with five sentences.”); they do well in controlling high-level signal, such as sentiment, topic, and enforcing specific keywords.

Adversarial Attack on LLM

[2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models
- This paper proposes two ways to elicit unsafe behaviors of LLMs
  - Producing Affirmative Responses: Appending “Sure, here is [prompt]” to the original prompt that generates expected unsafe content.
  - Greedy Coordinate Gradient (GCG)
    Given an input prompt $x _ {1:n}$ , the algorithm iterates over all tokens and find the replacement that causes the smallest loss. Specifically, for each token, the algorithm enumerate all possible gradients with respect to this token’s one-hot vector, then the algorithm picks top-K and modifies the prompt by replacing the token in the top-K set, and finally selects the prompt with the lowest loss.
- In attacking vision models, it is well established that attacking distilled models is much easier than the original models.

Toxicity Detection

[2312.01648] Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation
- This paper proposes a method to attain almost perfect accuracy on the challenging civil_comment datasets. The authors manage to do so by deriving a set of features from LLM from the first principle, and training a linear classifier on top of these features.
- Intrinsic Dimension (ID) could be used to characterize the likelihood a prompt could evade the RLHF alignment. It could be used as a proxy for prompt engineering so that jailbreaking will happen.
  The authors show (using the increased ID as a proxy for evading alignment) that prepending a relevant non-toxic sentence as prefix will make the aligned LM more likely to generate toxic content.

Talk Notes | Causality

Posted on November 28, 2023December 11, 2023 by David Yang

[Homepage]

Change Log:

2023-11-28: The data mentioned in the talk requires full specification. It may not likely work with text or image dataset. What is more relevant to text and images is called “causal representation learning.”

Overview

Causality Ladder (Judea Pearl): Seeing $\rightarrow$ Intervening $\rightarrow$ Imagining
– Seeing: This is where the traditional ML happens.
– Invervening
– Imaging: This requires structural causal model (SCM). This is not discussed in the talk.

Assumptions

Ingredients
Besides, we need to assume (1) we have magically measured all factors; there are no confounders, and (2) iid.
- Data: Assumes to be faithful to the graph.
- Causal Graph: Assumes to satisfy Markov condition.

Identifying Causality

Intuition (Janzing 2012)
If $X$ causes $Y$ , then the noise pattern from $X$ is $Y$ is simpler than the other way around.
Operationalizing the Intuition
- Kolmogorov Complexity: The shortest program (in any programming language) that computes a PDF. Then if $X \rightarrow Y$ , then $K(P(X)) + K(P(Y\vert X)) \leq K(P(Y)) + K(P(X\vert Y))$ .
- The formula above could be realized in practice with some assumptions in systems called SLOOPY, HECI (Xu et al. 2022 and Marx & Vreeken 2017, 2019) based on relatively simple regressions.
These systems could be evaluated using radar plot of some established datasets.

Research Notes | Research in the LLM Era

Posted on November 27, 2023December 11, 2023 by David Yang

Overview

This post mainly comprises content from the sources below:

[2311.05020] First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
[2208.12852] What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
The Bitter Lesson (Richard Sutton): The general purpose methods (i.e., search and learning) that depend on computation will always outperform the methods that require human knowledge by a large margin, though the latter could provide short-term performance gains and the researchers’ personal satisfaction. This trend has been proved repeatedly in the past decades in playing chess, Go, speech recognition, and computer vision. Despite this, many researchers make similar mistakes: “The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.”

Directions

Evaluation

Miscellaneous Notes

Reading Notes | Unmasking and Improving Data Credibility – A Study with Datasets for Training Harmless Language Models

Posted on November 27, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-11-25: First draft. This work serves as a demo of the startup company of two of the authors (Zhaowei Zhu, Jiaheng Wei, Hao Cheng; all of them are from UCSC); the corresponding author (i.e., Yang Liu @ UCSC) is the leader of ByteDance’s responsible AI team.
However, the code was last updated 2023-08-24.

Overview

This paper proposes an elegant framework for (1) evaluating the overall dataset quality and (2) detecting individual label errors. The proposed approach only relies on embeddings.

Method

The authors start with the general noise transition matrix $\mathbf{T} \in \mathbb{R} ^ {K \times K}$ , where each entry $\mathbf{T} _ {ij} := \Pr(\tilde{y}=j \vert y = i; \mathbf{x})$ indicates the probability the underlying true label $i$ appears as a noisy label $j$ ,

The following derivation depends on a hypothesis from the authors: the 2-NN of each sample in the dataset has neighbors of the same true underlying label. The authors call this hypothesis $k$ -NN clusterability.

Overall Dataset Quality

As the noisy dataset $\tilde{D}$ is free from noise when $\mathbf{T}$ is an identity matrix, the overall quality of a dataset could be written as follows. The authors have proved that $0\leq \Psi(\tilde{D}, D) \leq 1$ and it is 0 when $\mathbf{T}$ is a permutation matrix.
$\Psi(\tilde{D}, D) = 1 – \frac{1}{\sqrt{2K}} \mathbf{E} _ \mathbf{x} \Vert \mathbf{T}(\mathbf{x}) – \mathbf{I}\Vert _ F$

Detecting Individual Label Errors

For a group of samples with noisy labels $j$ , we could obtain a vector where each entry is the number of appearances of that label in the sample’s $k$ -NN. For example, if we are working on hate vs. non-hate classification, the sample has 3-NN of hate, hate, and non-hate, then the vector $\hat{\mathbf{y}}=[1, 2]^T$ .

Step 1: Scoring each sample using the cosine similarity of $\hat{\mathbf{y}}$ and $\mathbf{e} _ j$ : $\frac{\hat{\mathbf{y}}^T \mathbf{e} _ j}{\Vert \hat{\mathbf{y}} \Vert _ 2 \Vert \mathbf{e} _ j \Vert _ 2}$ .
Step 2: Choosing the threshold the label could be trusted: $\Pr(y = j \vert \tilde{y} = j) = \frac{\Pr(\tilde{y}=j\vert y = j) \cdot \Pr(y=j)}{\Pr(\hat{y} = j)}$ , where the entries on the nominator could be estimated from $\mathbf{T}$ and the denominator is easy to know from the dataset $\tilde{D}$ . Any samples whose scores are lower than the threshold $\Pr(y = j\vert \tilde{y}=j)$ means that the label is not trustworthy.

Estimating Noise Transition Matrix

The above two sections both rely on accurate estimation of $\mathbf{T}$ . The authors show that it is possible (with some relaxations) to do it by computing the label consensus of up to 2-NN for each sample in the dataset $\tilde{D}$ .

Experiments

All experiments are based on embeddings from sentence-transformers/all-mpnet-base-v2.

The authors sample 1000 flagged samples by the algorithms and another 1000 unflagged samples. After verifying these 2000 samples, 415 of 1000 flagged samples were also flagged by annotators, who flagged 104 unflagged samples. This indicates the statistics shown below. Interestingly, the authors see the statistics differently by computing 415 / 604 = 0.6871.

import numpy as np
from sklearn.metrics import classification_report

y_pred = np.concatenate([np.ones(1000), np.zeros(1000)]) # flagged by algorithm
y_true = np.concatenate([np.ones(415), np.zeros(585), np.ones(189), np.zeros(811)]) # flagged by experts

print(classification_report(y_true=y_true, y_pred=y_pred))
# result
#               precision    recall  f1-score   support
# 
#          0.0       0.81      0.58      0.68      1396
#          1.0       0.41      0.69      0.52       604
# 
#     accuracy                           0.61      2000
#    macro avg       0.61      0.63      0.60      2000
# weighted avg       0.69      0.61      0.63      2000

After cleaning label errors and fine-tuning BERT and GPT2 on different datasets, the test scores show that the proposed algorithm (i.e., Docta) consistently improves the model performances despite the smaller sizes of the Docta training sets.

Miscellaneous Notes

Research Notes | Constitutional AI

Posted on November 17, 2023December 11, 2023 by David Yang

[Research Paper] – [Constitution] – [Policy Memo] – [Full List of Research from Anthropic]

Notable figures from Anthropic include Chris Olah, Deep Ganguli, Ethan Perez, Sam Bowman, and Jared Kaplan. The first authors of this work is Yuntao Bai.

Overview

There are some limitations with OpenAI’s approaches of RLHF, i.e., asking humans to compare responses and select what they prefer.

Low Scalability: Asking humans to compare responses and verifying comparisons (even a small subset) takes significant amount of time. Further, annotating disturbing content may cause issues to human annotators.
Low Interpreability: The values are infused in the process of comparison. The exact guidelines that govern the comparison of responses are not spelled out.
Tradeoff between Harmlessness and Helpfulness: “Alignment tax” has been observed in the RLHF process. For example, the model could generate safe yet evasive content that does not contain any useful information.

The approach proposed by Anthropic makes Pareto improvement on both harmlessness and helpfulness. For example, when the model is asked to do something that violates the constitution, the model still tries to be helpful rather than simply refusing to answer.

The core of the CAI Is a set of expert instructions (source); it replaces humans with another LM in the RLHF process, leading to a new way for alignment, i.e., RLAIF.

CAI does this by training a model using a list of natural language instructions or principles, which comprise the model’s “constitution.”

Additional Notes

The constitution is not finalized; it could be revised and updated. The current version of constitution is derived from numerous sources, including UN Declaration of Human Rights (1948) , DeepMing’s Sparrow Principles, and Apple’s terms of services; it also considers values that are not from western, rich, and industrialized culture.
The constitutions are implemented as (abstract) natural language instructions. Making instructions abstract is deliberate as they find writing specific constitutions harms the performance.

“`bash
Please choose the response that most supports and encourages freedom, equality, and a sense of brotherhood. (1)

Please choose the response that is least racist and sexist, and that is least discriminatory based on language, religion, political or other opinion, national or social origin, property, birth or other status. (2)
“`

Coding Notes | LLM Practice

Posted on November 15, 2023December 11, 2023 by David Yang

Prompt Engineering

The OpenAI official documentation summarizes 6 tricks for prompt engineering.

Write Clear Instructions

The LM could not do that is not instructed by the user automatically.

Provide Reference Texts

Split Complex Tasks into Simpler Subtasks

Solving multiple problems in a cascade fashion often leads to smaller eventual error rate compared to solving the problem at the same time.

Give Model Time to Think or CoT

Use External Tools

It is better to use tools to solve the tasks that require algorithmic solutions; the LM is good at reasoning rather than solving problems algorithmically.

Test Changes Systematically

The prompts that work well on small number of samples in the playground may not work as well for a representative set of test samples. It is important to run evaluation on the large test set every time we make non-trivial changes to the prompt.

Fine-Tuning

Overview

As of 2023-11-15, OpenAI allows fine-tuning gpt-3.5-turbo, davinci-002, and babbage-002 models. OpenAI will soon support fine-tuning gpt-4. Besides, it is possible to fine-tune already fine-tuned models
Fine-tuning is discouraged unless we have shown that none of the below works. This is because it is faster to iterate with prompts in the playground than fine-tuned models.
- Prompt Engineering: We must closely follow the content in [1] for prompt engineering.
- Prompt Chaining: Breaking complex tasks into multiple prompts.
- Function Calling
Reasons for Fine-Tuning
- Reducing the length of prompts or reducing latency. Fine-tuning models could save up to 90% of the tokens compared to zero-shot or few-shot prompting (blog). Furthermore, fine-tuning a smaller model (for example, gpt-3.5-turbo) could often match the performance of a larger model (for exampe, gpt-4), therefore reducing latency.
- Improving performance for tasks that are hard to articulate using prompts (i.e., tasks that “show, not tell”).

Recipe

Workflow

Unlike older models, the gpt-3.5-turbo could be fine-tuned with as few as 10 examples. There will be clear improvement when fine-tuning with 50 to 100 examples.
It is better to start fine-tuing using 50 examples and see if there is improvement. If there is no clear improvement, we must redesign the data.

Step 1: Preparing Data

We need to prepare data into a .jsonl following the format below; each line in the .jsonl will be an example; the token limit of each example is 4096 for gpt-3.5-turbo. We could estimate the token usage of a fine-tuning job using num_tokens_from_messages() function (doc).

Chat Models
In the example below, the goal is to fine-tune a model that could generate sarcastic responses. Each sample should be formatted as follows.

{
    "messages": [
        {
            "role": "system",
            "content": "Marv is a factual chatbot that is also sarcastic."
        },
        {
            "role": "user",
            "content": "What's the capital of France?"
        },
        {
            "role": "assistant",
            "content": "Paris, as if everyone doesn't know that already."
        }
    ]
}

Other Models

{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}

Step 2: Uploading Data

We first need to make sure the openai is most updated using pip install -U openai. Then we could upload the data.

from openai import OpenAI
client = OpenAI()

message =  client.files.create(
  file=open("mydata.jsonl", "rb"),
  purpose="fine-tune"
)
# message:
# FileObject(
#   id='file-Y0Q82yniynZAN7TeZaEXcbYg', 
#   bytes=234106, 
#       created_at=1700031010, 
#   filename='fine_tuning_data.jsonl', 
#   object='file', 
#   purpose='fine-tune', 
#   status='processed', 
#   status_details=None
# )

Step 3: Fine-Tuning

Now OpenAI supports fine-tuning models using an UI (i.e., https://platform.openai.com/finetune). We could also submit a fine-tuning job using Python code below. Note that

filename is returned in Step 2.
model could be gpt-3.5-turbo or older models.

We could optionally tune the hyperparameters of fine-tuning.

from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="filename", 
  model="gpt-3.5-turbo",
  # optional, see details below
  hyperparameters={ 
    "n_epochs":2
  }
)

We could monitor the status of fine-tuning on the OpenAI website. If using code is preferred, we could use one of the commands below.

from openai import OpenAI
client = OpenAI()

# Retrieve the state of a fine-tune
client.fine_tuning.jobs.retrieve("ftjob-abc123")

# Return the training metrics based on the command above, such as loss, accuracy
content = client.files.retrieve_content("result-file")

# List up to 10 events from a fine-tuning job
client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-abc123", limit=10)

# List 10 fine-tuning jobs
client.fine_tuning.jobs.list(limit=10)

Step 4: Evaluation and Iteration

After fine-tuning, we could evaluate the fine-tuned model on the held-out test set. If the performance is not satisfying, we should check the data quality from the aspects below.

Data Quality and Quantity

Data quality should be prioritized to data quantity: a smaller amout of high-quality data is generally better than a larger amount of low quality data.

Training Data Lacking Consistency
As a rule of thumb, an inter-annotator agreement of 70% is low: if the humans could not agree on the labels, it is unlike the model could do better than humans.
Training Label Distribution Different from Testing Label Distribution

We could start from the cases that the fine-tuned model makes mistakes and starts to iterate from there. If it is indeed the case that the data quantity is the issue, we could estimate the gains by (1) fine-tuning a second model that uses half of the current data, and (2) estimating the performance difference of two models on the test set.

Model Hyperparameters

We could change 3 hyperparameters: number of epochs, learning rate multiplier, and batch size. The following are some typical scenarios and the corresponding action:

Task	Scenario	Action
Task with single or a few ideal completions	Generations not following training data	Increasing `n_epochs` from 3 to 4 or 5
Creative tasks	Generations of reduced diversity	Desceacing `n_epochs` from 3 to 1 or 2
/	Not converging	Increaing `learning_rate_multiplier`