Microsoft’s new Phi 3.5 LLM models surpass Meta and Google

Across benchmarks rating models on reasoning and multilingual skills, such as BigBench, MMLU, and ARC Challenge, the MoE-instruct model, although with fewer parameters than rivals (6.6 billion) performed better than Llama 3.1-8B-instruct, Gemma 2-9b-It, and Gemini 1.5-Flash. However, it could not match the performance of OpenAI’s GPT-4o-mini-2024-07-18 (chat).

However, the company pointed out that the model is still fundamentally limited by its size for certain tasks.

“The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness,” it said, adding that this weakness can be resolved by augmenting Phi-3.5 with a search engine, particularly when using the model under RAG settings.

READ SOURCE