Introduction

In light of the hype around Large Language Models (LLMs) — a phenomenon I don’t consider entirely fictional — every actor in this realm begins showering compliments on its published model. However, amidst the accolades, a crucial question looms: Is there a standard benchmark for evaluating the true power of these language-processing monsters? In this concise exploration, I’ll share my general thoughts on the efficiency and features of different popular LLMs as of the current time. Despite delving into paper readings and rigorous number comparisons, I strive to maintain an abstract tone that engages any reader, while also ensuring an unbiased view and reproducible results. I aim to keep this information regularly updated.

Methodology

Given the absence of a comprehensive benchmark examining all popular LLMs using an appealing methodology, I embarked on a systematic comparison. Starting with articles known for standing out in specific aspects of their benchmarking or methodology, I looked for empirical results that largely agreed (over 80%). I consider such results complementary, capable of being combined to paint a more complete landscape of the current state.

The initial set of benchmark articles comprised 10 entries, but some were filtered out due to reasons such as failing to keep up with the latest published LLMs, inconsistent experiment settings, or a lack of supporting code bases and complementary materials.

After filtering, two out of the initial 10 benchmarks remained onboard. Their most recent update was in December 2023 (less than one month from now as of writing), and they conducted a series of experiments with consistent settings. These benchmarks not only share complementary material but also provide ablation studies and valuable insights.

The agreement criteria between these two benchmarks were checked over three common LLM models on 16 well-known tasks, resulting in an 81 percent concurrence. This approach allows us to select results from a more comprehensive study, verified by another well-established methodology, for the set of LLMs currently considered hot and in demand.

The primary focus was on benchmarking the most suitable open-source LLMs across different tasks, while also keeping popular commercial ones on the horizon. This strategy aims to provide a more tangible perception of how far or near LLMs are in various tasks.

This post is licensed under CC BY 4.0 by the author.