16 6 25

Emin Temiz PRO

etemiz

https://pickabrain.ai

AI & ML interests

Alignment

Recent Activity

posted an update about 5 hours ago

how to expand your dataset (of articles) without changing the ideas in it? i was doing CPT for a while and got decent results. but what if i want to go for perfection? cover all the areas of misalignment using limited datasets. i have to find a way to multiply the material to successfully combat the material of the rest of the internet. i want to generate SFT datasets but only on controversial topics, because i have to be efficient with limited resources. first i give a smart LLM a 'ground truth' text. then i give it the following prompts: ``` - You are a highly skilled academic analyst. - Analyze this text and find 3 bold claims that could cause controversy and division in public. List the claims and also state why they are debatable. Give numbers to the claims. - Convert these claims into binary questions (that could be answered by yes/no or this/that). - Now put these questions in a json format. Please also add the info about which of the answers concur with the original text and the question number. - Write some supporting arguments for 1st question, with respect to the original text, concurring and confirming the original text. There must be about 300 words. You should not mention the text, write it as if you are the one answering the question. ``` the result is questions and answers with more words along the same ideas. a few sentences of opinions in the beginning, is expanded to lots of words. using this method i can multiply billions of tokens to tens of billions probably and have a more effective training. next i should do RL maybe. LLMs seem to have all kinds of ideas already installed, yet they don't have the intuition to know which one is true. they can give you a ton of reasons to support anything. given the proper incentives, LLMs then should evolve towards supporting aligned ideas more. the rewards will be like guidance that will kick an LLM towards better answers.

replied to their post 9 days ago

I realized when I ask longer answers to my questions, the models sometimes produce completely opposite answer. What could be the reason? I do mostly CPT. Should I convert my dataset to SFT and give longer reasonings too for it to have integrity? Example: Is the yolk of an egg more beneficial or the white? Answer in 100 words. Answer: Yolk is more beneficial because .......... Example: Is the yolk of an egg more beneficial or the white? Answer in 500 words. Answer: White is more beneficial because .......... Edit: These happen in temp = 0.0

liked a model 10 days ago

huihui-ai/Huihui-GLM-4.6-abliterated-GGUF

View all activity

Organizations

None yet

posted an update about 5 hours ago

Post

- You are a highly skilled academic analyst.

- Analyze this text and find 3 bold claims that could cause controversy and division in public. List the claims and also state why they are debatable. Give numbers to the claims.

- Convert these claims into binary questions (that could be answered by yes/no or this/that).

- Now put these questions in a json format. Please also add the info about which of the answers concur with the original text and the question number.

- Write some supporting arguments for 1st question, with respect to the original text, concurring and confirming the original text. 
There must be about 300 words. You should not mention the text, write it as if you are the one answering the question.

the result is questions and answers with more words along the same ideas. a few sentences of opinions in the beginning, is expanded to lots of words. using this method i can multiply billions of tokens to tens of billions probably and have a more effective training.

next i should do RL maybe. LLMs seem to have all kinds of ideas already installed, yet they don't have the intuition to know which one is true. they can give you a ton of reasons to support anything. given the proper incentives, LLMs then should evolve towards supporting aligned ideas more. the rewards will be like guidance that will kick an LLM towards better answers.

replied to their post 9 days ago

Thanks for the tips.
Is giving different answers for different lengths a bad "behavior" and related to SFT than CPT?
Also, should I give two sets of queries and answers in the context (one short one long) to make it learn that when the length changes, the answer should be parallel?
This could be RL too, like bad behavior of non integrity can be penalized...
Is it normal practice to do 2 rounds of questions in SFT or RL?

reacted to georgewritescode's post with 🚀 11 days ago

Post

3144

Announcing Artificial Analysis Long Context Reasoning (AA-LCR), a new benchmark to evaluate long context performance through testing reasoning capabilities across multiple long documents (~100k tokens)

The focus of AA-LCR is to replicate real knowledge work and reasoning tasks, testing capability critical to modern AI applications spanning document analysis, codebase understanding, and complex multi-step workflows.

AA-LCR is 100 hard text-based questions that require reasoning across multiple real-world documents that represent ~100k input tokens. Questions are designed so answers cannot be directly found but must be reasoned from multiple information sources, with human testing verifying that each question requires genuine inference rather than retrieval.

Key takeaways:
➤ Today’s leading models achieve ~70% accuracy: the top three places go to OpenAI o3 (69%), xAI Grok 4 (68%) and Qwen3 235B 2507 Thinking (67%)

➤👀 We also already have gpt-oss results! 120B performs close to o4-mini (high), in-line with OpenAI claims regarding model performance. We will be following up shortly with a Intelligence Index for the models.

➤ 100 hard text-based questions spanning 7 categories of documents (Company Reports, Industry Reports, Government Consultations, Academia, Legal, Marketing Materials and Survey Reports)

➤ ~100k tokens of input per question, requiring models to support a minimum 128K context window to score on this benchmark

➤ ~3M total unique input tokens spanning ~230 documents to run the benchmark (output tokens typically vary by model)

We’re adding AA-LCR to the Artificial Analysis Intelligence Index, and taking the version number to v2.2. Artificial Analysis Intelligence Index v2.2 now includes: MMLU-Pro, GPQA Diamond, AIME 2025, IFBench, LiveCodeBench, SciCode and AA-LCR.

Link to dataset: ArtificialAnalysis/AA-LCR

replied to their post 12 days ago

Thanks for the input but these happened all when temp = 0.0

My guess is, since I use mostly datasets generated from voice, the models are one thing when they are talking like a human in day to day life, but completely opposite when they are feeling like a scientist, producing a long text..

posted an update 13 days ago

Post

1795

I realized when I ask longer answers to my questions, the models sometimes produce completely opposite answer. What could be the reason?

I do mostly CPT. Should I convert my dataset to SFT and give longer reasonings too for it to have integrity?

Example: Is the yolk of an egg more beneficial or the white? Answer in 100 words.

Answer: Yolk is more beneficial because ..........

Example: Is the yolk of an egg more beneficial or the white? Answer in 500 words.

Answer: White is more beneficial because ..........

Edit: These happen in temp = 0.0

5 replies

posted an update 16 days ago

Post

276

what is the safest llm to run in robots?

https://youtu.be/byQmJ9x0RWA?t=640

1 reply

posted an update 21 days ago

Post

2182

looks like the best way to incorporate truth in AI is to use some kind of RAG.
what are the state of the art ways to consume knowledge graphs?
and what is the best way to build a knowledge graph using AI?

2 replies

replied to their post 22 days ago

Yes. I meant the case where parameters in the LLM does not match the retrieved knowledge.

Retrieved knowledge may say "the white of an egg is more beneficial", whereas LLM may have the opinion "the yolk". And eventually it may produce the yolk as the answer even though the context is full of white.

Google has a benchmark for this i think "FACTS Grounding". I don't agree with its name choice but it is relevant here.

replied to their post 27 days ago

I made an LLM act as an aggregator and combine answers of several other LLMs like in a mixture of agents scenario. The aggregator does not always produce the average or median answer. It brings its own opinion.

I think Google has a benchmark for this (sticking to context and not bringing its own words).

posted an update 28 days ago

Post

1724

Today's winner is Ling 1T with a score of 38!

Btw AHA2 is in the works, with more domains, better comparison LLMs and questions, overall better signal.

1 reply

posted an update 29 days ago

Post

248

Fine tuning is also important in a RAG system. The LLM will bring its own opinion sometimes and tell a different answer which is contrary to the retrieved knowledge. One should use an aligned LLM to produce the final answer.

4 replies

posted an update about 2 months ago

Post

1115

Two new entries to the AHA Leaderboard:

Kimi was OK until it started "thinking" ..

posted an update about 2 months ago

Post

1757

Working on a new version of AHA Leaderboard.

- Better signal thanks to new models like Enoch
- MoA of top and bottoms of current leaderboard to add more diverse inputs
- Better questions
- A faster and more precise measurement methodology
- Explanations on how each column is calculated
- More sample questions revealed

Current version: https://huggingface.co/blog/etemiz/aha-leaderboard

Tell me how I can improve more.

posted an update 2 months ago

Post

294

This is a very human aligned fine tune and would score 56 on AHA leaderboard:

CWClabs/CWC-Mistral-Nemo-12B-V2-q4_k_m

We are going to be using it as one of the ground truths for AHA Leaderboard 2.0 (the next version).

We will be able to generate some RL datasets for folks to align their own LLMs with humanity. We will generate answers from best models and worst models and do mixture of agents that combines the answer, and publish results as dataset(s). Things looking bright!

posted an update 3 months ago

Post

302

Benchmarked 2 more models today. Any other model you want me to do?

Working on a broader version of the AHA leaderboard. Follow for more quackery :)

replied to their post 3 months ago

This fine tuning would score 56 and be placed 1st in the leaderboard but I didn't add it, I only include full trainings in the leaderboard or (further tunings by the same company):

https://huggingface.co/CWClabs/CWC-Mistral-Nemo-12B-V2-q4_k_m

posted an update 3 months ago

Post

271

O oh! things looking uglier by day

LLM builders in general are not doing a great job of making human aligned models.

I don't want to say this is a proxy for p(doom)... But it could be if we are not careful.

Most probable cause is reckless training LLMs using outputs of other LLMs, and don't caring about curation of datasets and not asking 'what is beneficial for humans?'...

1 reply

replied to their post 3 months ago

https://huggingface.co/huihui-ai/Huihui-GLM-4.5-Air-abliterated-GGUF/tree/main/Q3_K_M-GGUF

posted an update 3 months ago

Post

1074

Another abliteration by huihui which had big positive impact!

huihui-ai/Huihui-GLM-4.5-Air-abliterated-GGUF

@huihui-ai

3 replies

replied to their post 3 months ago

Our leaderboard can be used for human alignment in an RL setting. Ask the same question to top models and worst models and the answer from top models can get +1 score, bad models can get -1. Ask many times with higher temperature to generate more answers. What do you think?

Emin Temiz PRO

AI & ML interests

Recent Activity

Organizations

etemiz's activity