LLM inferences tend to be erratically wrong. So, 99% of the time the answer is correct but 1% of the time it may be wrong and wrong in a way that is hard to predict and account for. In TruU we have technologies beyond just calibration for accounting for errors in the deep models. This article looks at the current research on how to solve this problem.

White Box Approaches

Logits in the last layer. These approaches look at internal state of the model to compute the probabilities. For classifiers, this could just be looking at the weights of the final layer and normalizing, for generators this could mean taking the logit of each token, and computing a statistical metric (mean, max, median) for all the tokens in the output.[Duan et al 2023], [Kuhn et al. 2023] Even though they are pretty commonly used they are not known to be very reliable.
Instead of logits, one can measure entropy of the output using the internal states and use it similarly.
Using an ML model on Embeddings[Ren et al. 2022 ]: For LLMs where embeddings for input and output were present. Given a training set with a set of \<Question, Answer, True/False >
- Compute the features. Given a training set of tuples \< embedding(Q), embedding(A), T/F> concatenate the two embeddings, so we have the following \< n-dimensional point, T/F>.
- Run a logistic regression on this data such that Function(Q,A) = probability of correctness
- When the LLM produces an answer A1 given a question Q1, we can use the logistic regression to compute a confidence in the result. So, the idea being that if an LLM has good knowledge about a certain subset of the embedding space, then it will continue to have correct knowledge about that space.

Block Box Approaches

These are the bigger subset of the research as these approaches have wider applicability. A lot of commercially available LLMs are closed sourced so producing some level of confidence on their outputs continues to be an important target.

Make LLM reflect on its ideas. A lot of these ideas fall into the category of asking LLMs to reflect on their work. This can be as simple as asking LLM compute the confidence of its output to different ways in which we can have LLMs self reflect on the results.
- [Wagner et al. 2024]
  - Ask a question, get the answer
  - Generate features
    - assume the answer is correct
    - generate various rationales for the answer
    - assume the answer is incorrect
    - generate various rationales for the answer being wrong
  - Ask the LLM to predict the probability of the answer to be correct given the rationale and vice versa.
  - Use this to create a confusion matrix and then come up with the confidence. Even though this is provided only for a binary classifier, the same approach can be extended to a multi-class kind of questions
- [Shrivastava et al. 2023] This approach uses an White Box approach by picking a surrogate model e.g. Llama to generate the confidence of the generated by the GPT. When Llama may not generate the same answer they may either not return the result or have an ensemble of open source models to produce the result. To me this approach doesn’t sound credible. Maybe I dont fully understand how it works.
- [Li, Moxin, et al. 2024 ] Generate multiple answers by prompting LLM to generate say 5 answers, have it generate multiple justifications for each answer. Prompt the LLM with the question, all the answers, and all the justifications and have it come up with probabilities for each answer. Rerun it again and again by shuffling the order of the justifications etc. Eventually take the average of the resulting probabilities. The authors mention that shuffling of justifications was specially important.
- [Becker and Soatto 2024] Another variant. Generate explanations for each of the answers. Figure out entailment probability of each of the explanations. Figure out the distribution of the answers given the explanation and then marginalize the explanations given an answer. My mathematical bent appreciates this approach but there is no way to say whether this is any better than the previous approach
- [Pedapati et al. 2024] I like this approach but this is strictly a emotional bias with not much of a scientific basis for my preference. Get a data set which has the Question and Answers. Pertub the questions by various means to generate 100s of questions. These will generate lots of different answers. For each of these question answer sessions compute the 3 features researched: (a) semantic set of the outputs, (b) lexical similarities of the outputs (rouge score), and (c) SRC minimum value. Now we know whether the answer generated was correct or not, so given these features and the correctness of answer we can create a logistic regression. When attempting an answer in the live setting we once again compute these 3 features and predict the confidence in the result. The fundamental insight is that these features capture the knowledge and the correctness of the knowledge of the LLM.

Thoughts?

At this point it is not clear whether there is a clear winner. What I have found in practice are the following best practices:

Use multiple models from different families, and use their ensemble.
Rather than attempting to give a high fidelity answer, give an answer when u are sure but choose not to give an answer whenever in doubt. So, e.g. if you are using google, OpenAI and Anthropic as your models, give an answer with a lot of certainty when all three models agree and not give any answer when there is any disagreement.
Use an ensemble of the above techniques rather than any single one.
I believe an approach like [Pedapati et al. 2024 ] which creates some features on the structure of knowledge that the LLM has with some RLHF may eventually prove to be the best but this is just a hypothesis at this time.

Still waiting to find some research that blows me away. I am sure there are 10 more papers in this space between me doing this research and writing this entry and perhaps a 100 more by the time you read it.

Reference

[Becker and Soatto 2024] Becker, Evan, and Stefano Soatto. “Cycles of Thought: Measuring LLM Confidence through Stable Explanations.” arXiv preprint arXiv:2406.03441 (2024).
[Duan et al 2023] Jinhao Duan, Hao Cheng, Shiqi Wang, Chenan Wang, Alex Zavalny, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2023. Shifting attention to relevance: Towards the uncertainty estimation of large language models. ArXiv preprint, abs/2307.01379.
[Kuhn et al. 2023] Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for un- certainty estimation in natural language generation. ArXiv preprint, abs/2302.09664.
[Li, Moxin, et al. 2024 ] Li, Moxin, et al. “Think twice before assure: Confidence estimation for large language models through reflection on multiple answers.” arXiv preprint arXiv:2403.09972 (2024).
[Pedapati et al. 2024] Pedapati, Tejaswini, et al. “Large Language Model Confidence Estimation via Black-Box Access.” arXiv preprint arXiv:2406.04370 (2024).
[Ren et al. 2022 ] Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mo- hammad Saleh, Balaji Lakshminarayanan, and Pe- ter J Liu. 2022. Out-of-distribution detection and selective generation for conditional language models. ArXiv preprint, abs/2209.15558.
[Shrivastava et al. 2023 ] Vaishnavi Shrivastava, Percy Liang, and Ananya Kumar. 2023. Llamas know what gpts don’t show: Surrogate models for confidence estimation. ArXiv preprint, abs/2311.08877.
[Wagner et al. 2024]Wagner, Nico, et al. “Black-box Uncertainty Quantification Method for LLM-as-a-Judge.” arXiv preprint arXiv:2410.11594(2024).

Assigning Confidences to LLM Outputs

Amit Agrawal

White Box Approaches

Block Box Approaches

Thoughts?

Reference