Scaling Instruction-Finetuned Language Models
What's new?
data:image/s3,"s3://crabby-images/dbd29/dbd29e2b52a4c942e05b3d1d46c6843949f7880d" alt="FLAN1"
Image Source: Scaling Instruction-Finetuned Language Models (opens in a new tab)
This paper explores the benefits scaling instruction finetuning (opens in a new tab) and how it improves performance on a variety of models (PaLM, T5), prompting setups (zero-shot, few-shot, CoT), and benchmarks (MMLU, TyDiQA). This is explored with the following aspects: scaling the number of tasks (1.8K tasks), scaling model size, and finetuning on chain-of-thought data (9 datasets used).
Finetuning procedure:
- 1.8K tasks were phrased as instructions and used to finetune the model
- Uses both with and without exemplars, and with and without CoT
Finetuning tasks and held out tasks shown below:
data:image/s3,"s3://crabby-images/d5f96/d5f96d62e9c10e8b9cd4fdb0bedf7c95fb4a4d71" alt="FLAN11"
Capabilities & Key Results
- Instruction finetuning scales well with the number of tasks and the size of the model; this suggests the need for scaling number of tasks and size of model further
- Adding CoT datasets into the finetuning enables good performance on reasoning tasks
- Flan-PaLM has improved multilingual abilities; 14.9% improvement on one-shot TyDiQA; 8.1% improvement on arithmetic reasoning in under-represented languages
- Plan-PaLM also performs well on open-ended generation questions, which is a good indicator for improved usability
- Improves performance across responsible AI (RAI) benchmarks
- Flan-T5 instruction tuned models demonstrate strong few-shot capabilities and outperforms public checkpoint such as T5
The results when scaling number of finetuning tasks and model size: scaling both the size of the model and the number of finetuning tasks is expected to continue improving performance, although scaling the number of tasks has diminished returns.
data:image/s3,"s3://crabby-images/2f11f/2f11fdd4b4d610566ef759067387512ca23d5c6a" alt="FLAN2"
Image Source: Scaling Instruction-Finetuned Language Models (opens in a new tab)
The results when finetuning with non-CoT and CoT data: Jointly finetuning on non-CoT and CoT data improves performance on both evaluations, compared to finetuning on just one or the other.
data:image/s3,"s3://crabby-images/ad6fe/ad6fececde2783af8b492e797893f97406537412" alt="FLAN3"
Image Source: Scaling Instruction-Finetuned Language Models (opens in a new tab)
In addition, self-consistency combined with CoT achieves SoTA results on several benchmarks. CoT + self-consistency also significantly improves results on benchmarks involving math problems (e.g., MGSM, GSM8K).
data:image/s3,"s3://crabby-images/de8ba/de8ba53ccf1c2d61081708a500824843c3afb4ed" alt="FLAN4"
Image Source: Scaling Instruction-Finetuned Language Models (opens in a new tab)
CoT finetuning unlocks zero-shot reasoning, activated by the phrase "let's think step-by-step", on BIG-Bench tasks. In general, zero-shot CoT Flan-PaLM outperforms zero-shot CoT PaLM without finetuning.
data:image/s3,"s3://crabby-images/79fed/79fed3b3f649c5691c6ef783ae05b390ffef9414" alt="FLAN6"
Image Source: Scaling Instruction-Finetuned Language Models (opens in a new tab)
Below are some demonstrations of zero-shot CoT for PaLM and Flan-PaLM in unseen tasks.
data:image/s3,"s3://crabby-images/d5825/d58256474a8af9bd6c57f83201d3b10c60aceae1" alt="FLAN5"
Image Source: Scaling Instruction-Finetuned Language Models (opens in a new tab)
Below are more examples for zero-shot prompting. It shows how the PaLM model struggles with repetitions and not replying to instructions in the zero-shot setting where the Flan-PaLM is able to perform well. Few-shot exemplars can mitigate these errors.
data:image/s3,"s3://crabby-images/bf4c7/bf4c7d8ae9c94ec8c7b59733e6ffbb70633a90d0" alt="FLAN7"
Image Source: Scaling Instruction-Finetuned Language Models (opens in a new tab)
Below are some examples demonstrating more zero-shot capabilities of the Flan-PALM model on several different types of challenging open-ended questions:
data:image/s3,"s3://crabby-images/3d700/3d70086cf894dc0b8cf7b6cf84fa94e7f57f382d" alt="FLAN8"
Image Source: Scaling Instruction-Finetuned Language Models (opens in a new tab)
data:image/s3,"s3://crabby-images/16cae/16cae2671aff33569185771d05f818b11ea8d3a3" alt="FLAN9"
Image Source: Scaling Instruction-Finetuned Language Models (opens in a new tab)
data:image/s3,"s3://crabby-images/70f3c/70f3cc35ce0dc39ef8f791b203cf81a3c642e3dd" alt="FLAN10"
Image Source: Scaling Instruction-Finetuned Language Models (opens in a new tab)
You can try Flan-T5 models on the Hugging Face Hub (opens in a new tab).