Vectara’s Hallucination Leaderboard offers a unique benchmark for this, focusing on the tendency of Large Language Models (LLMs) to “hallucinate” or fabricate information during summarization tasks. Updated as of November 1, 2023, the leaderboard offers an insightful glimpse into how well-known LLMs like GPT-4, GPT-3.5, and Llama perform under scrutiny.
The Evaluation Method
The leaderboard, based on Vectara’s Hallucination Evaluation Model, measures several critical aspects:
- Accuracy: The percentage of responses without hallucinations.
- Hallucination Rate: The frequency of fabricated information in responses.
- Answer Rate: The proportion of prompts the model responds to.
- Average Summary Length: The word count of the model’s summaries.
Current Standings
Here’s a snapshot of the standings as of the last update:
Model | Accuracy | Hallucination Rate | Answer Rate | Avg. Summary Length (Words) |
---|---|---|---|---|
GPT 4 | 97.0 % | 3.0 % | 100.0 % | 81.1 |
GPT 3.5 | 96.5 % | 3.5 % | 99.6 % | 84.1 |
Llama 2 70B | 94.9 % | 5.1 % | 99.9 % | 84.9 |
… | … | … | … | … |
Google Palm | 87.9 % | 12.1 % | 92.4 % | 36.2 |
Google Palm-Chat | 72.8 % | 27.2 % | 88.8 % | 221.1 |
Behind the Rankings
The methodology involves feeding 1000 short documents to each LLM and analyzing their summarized outputs. The choice to focus on summarization accuracy, instead of overall factual accuracy, allows a comparative analysis of the model’s response to the provided information.
Looking Ahead
Vectara plans to regularly update this leaderboard, adding more models, including GPT4 Turbo, and expanding the scope to citation accuracy and multilingual capabilities. This ongoing effort reflects the dynamic nature of AI and the need for constant evaluation in the quest for reliable and truthful AI-generated content.
Explore more about this fascinating field and stay updated with the latest AI trends by visiting the Hallucination Leaderboard on GitHub.
0 Comments