removing taubench for now
Browse filesThe traces from the evals are, to put it slightly, really fucky-wucky, and I don't think OpenBench is scoring them right
README.md
CHANGED
|
@@ -14,9 +14,6 @@ Yes, this is official, and yes, this is, to my knowledge, a real version of Llam
|
|
| 14 |
|-|-|-|-|
|
| 15 |
|IFEval (1 epoch, score avged across all strict/loose instruction/prompt accuracies to follow Llama 3 paper)|78.2|81.95|84.775
|
| 16 |
|GPQA Diamond (3 epochs)|29.3|37.0|37.5
|
| 17 |
-
|Tau-Bench Airline (1 epoch, GPT-4.1 as user)|28.0*|N/A (tau cannot be ran at 8k context)|36.0*
|
| 18 |
-
|
| 19 |
-
\* One task had to be ended early due to infinitely looping.
|
| 20 |
|
| 21 |
All benchmarks done in OpenBench at 1.0 temp.
|
| 22 |
|
|
|
|
| 14 |
|-|-|-|-|
|
| 15 |
|IFEval (1 epoch, score avged across all strict/loose instruction/prompt accuracies to follow Llama 3 paper)|78.2|81.95|84.775
|
| 16 |
|GPQA Diamond (3 epochs)|29.3|37.0|37.5
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
All benchmarks done in OpenBench at 1.0 temp.
|
| 19 |
|