removing taubench for now

The traces from the evals are, to put it slightly, really fucky-wucky, and I don't think OpenBench is scoring them right

Files changed (1) hide show

README.md CHANGED Viewed

@@ -14,9 +14,6 @@ Yes, this is official, and yes, this is, to my knowledge, a real version of Llam
 |-|-|-|-|
 |IFEval (1 epoch, score avged across all strict/loose instruction/prompt accuracies to follow Llama 3 paper)|78.2|81.95|84.775
 |GPQA Diamond (3 epochs)|29.3|37.0|37.5
-|Tau-Bench Airline (1 epoch, GPT-4.1 as user)|28.0*|N/A (tau cannot be ran at 8k context)|36.0*
-\* One task had to be ended early due to infinitely looping.
 All benchmarks done in OpenBench at 1.0 temp.

 |-|-|-|-|
 |IFEval (1 epoch, score avged across all strict/loose instruction/prompt accuracies to follow Llama 3 paper)|78.2|81.95|84.775
 |GPQA Diamond (3 epochs)|29.3|37.0|37.5
 All benchmarks done in OpenBench at 1.0 temp.