Fizzarolli commited on
Commit
df95224
·
verified ·
1 Parent(s): 24aedd4

removing taubench for now

Browse files

The traces from the evals are, to put it slightly, really fucky-wucky, and I don't think OpenBench is scoring them right

Files changed (1) hide show
  1. README.md +0 -3
README.md CHANGED
@@ -14,9 +14,6 @@ Yes, this is official, and yes, this is, to my knowledge, a real version of Llam
14
  |-|-|-|-|
15
  |IFEval (1 epoch, score avged across all strict/loose instruction/prompt accuracies to follow Llama 3 paper)|78.2|81.95|84.775
16
  |GPQA Diamond (3 epochs)|29.3|37.0|37.5
17
- |Tau-Bench Airline (1 epoch, GPT-4.1 as user)|28.0*|N/A (tau cannot be ran at 8k context)|36.0*
18
-
19
- \* One task had to be ended early due to infinitely looping.
20
 
21
  All benchmarks done in OpenBench at 1.0 temp.
22
 
 
14
  |-|-|-|-|
15
  |IFEval (1 epoch, score avged across all strict/loose instruction/prompt accuracies to follow Llama 3 paper)|78.2|81.95|84.775
16
  |GPQA Diamond (3 epochs)|29.3|37.0|37.5
 
 
 
17
 
18
  All benchmarks done in OpenBench at 1.0 temp.
19