Phi-2: The surprising power of small language models.

2 min readDec 14, 2023

Over the past few months, our Machine Learning Foundations team at Microsoft Research has released a suite of small language models (SLMs) called “Phi” that achieve remarkable performance on a variety of benchmarks. Our first model, the 1.3 billion parameter Phi-1(opens in new tab), achieved state-of-the-art performance on Python coding among existing SLMs (specifically on the HumanEval and MBPP benchmarks). We then extended our focus to common sense reasoning and language understanding and created a new 1.3 billion parameter model named Phi-1.5(opens in new tab), with performance comparable to models 5x larger.

We are now releasing Phi-2(opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2(opens in new tab) available in the Azure AI Studio model catalog to foster research and development on language models.

Phi-2 is a Transformer-based language model developed by Microsoft Research. Despite its relatively small size (only 2.7 billion parameters), it has achieved impressive performance on various tasks, surpassing larger models like Mistral (7B parameters) and Llama-2 (13B parameters) in some cases.

Here are some key points about Phi-2:

Strong performance: Phi-2 outperforms larger models on several benchmarks, including:
Text summarization: Phi-2 achieves better ROUGE scores than Mistral and Llama-2 on CNN/Daily Mail summarization tasks.
Question answering: Phi-2 performs well on SQuAD and TriviaQA benchmarks, even outperforming the 25x larger Llama-2–70B model on multi-step reasoning tasks.
Coding: Phi-2 demonstrates strong ability to generate Python code and solve coding problems.
Efficiency: Phi-2’s smaller size makes it more efficient to train and deploy, requiring less computational resources and memory.
Potential for democratization: The success of Phi-2 suggests that smaller language models can be just as effective as larger ones for many tasks, making them more accessible to a wider range of users and developers.

Overall, Phi-2 highlights the potential of smaller language models and paves the way for more efficient and accessible AI solutions.

Here are some additional details about Phi-2:

It is a Transformer-based model, similar to many other large language models.
It was trained on a massive dataset of text and code.
It is still under development, but Microsoft has released a public version for research purposes.

Phi-2: The surprising power of small language models.

Written by Girff