The Hallucination Leaderboard: LLMs comparison

Vectara’s Hallucination Leaderboard offers a unique benchmark for this, focusing on the tendency of Large Language Models (LLMs) to “hallucinate” or fabricate information during summarization tasks. Updated as of November 1, 2023, the leaderboard offers an insightful glimpse into how well-known LLMs like GPT-4, GPT-3.5, and Llama perform under scrutiny.

The Evaluation Method

The leaderboard, based on Vectara’s Hallucination Evaluation Model, measures several critical aspects:

Accuracy: The percentage of responses without hallucinations.
Hallucination Rate: The frequency of fabricated information in responses.
Answer Rate: The proportion of prompts the model responds to.
Average Summary Length: The word count of the model’s summaries.

Current Standings

Here’s a snapshot of the standings as of the last update:

Model	Accuracy	Hallucination Rate	Answer Rate	Avg. Summary Length (Words)
GPT 4	97.0 %	3.0 %	100.0 %	81.1
GPT 3.5	96.5 %	3.5 %	99.6 %	84.1
Llama 2 70B	94.9 %	5.1 %	99.9 %	84.9
…	…	…	…	…
Google Palm	87.9 %	12.1 %	92.4 %	36.2
Google Palm-Chat	72.8 %	27.2 %	88.8 %	221.1

Full Leaderboard

Behind the Rankings

The methodology involves feeding 1000 short documents to each LLM and analyzing their summarized outputs. The choice to focus on summarization accuracy, instead of overall factual accuracy, allows a comparative analysis of the model’s response to the provided information.

Looking Ahead

Vectara plans to regularly update this leaderboard, adding more models, including GPT4 Turbo, and expanding the scope to citation accuracy and multilingual capabilities. This ongoing effort reflects the dynamic nature of AI and the need for constant evaluation in the quest for reliable and truthful AI-generated content.

Explore more about this fascinating field and stay updated with the latest AI trends by visiting the Hallucination Leaderboard on GitHub.