A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs

Lehan He^*1,2,3 , Zeren Chen^*1,2 , Zhelun Shi^1,2 , Tianyu Yu⁴
Jing Shao^†1 , Lu Sheng^†2

¹Shanghai AI Laboratory , ²School of Software, Beihang University
³Shanghai Innovation Institute , ⁴Tsinghua University
^*Indicates Equal Contribution , ^†Corresponding authors

Paper Code Dataset Model

Background and Motivation

Existing MLLMs continue to face challenges related to hallucinations. Recent attempts have employed human experts or powerful auxiliary AI systems to provide more accurate preference feedback. However, the responses of MLLMs are usually long, complex and ambiguous with inevitable flaws, which interferes the preference optimization due to the remaining hallucinations in the preferred responses.

An intuitive alternative is to enhance the quality of preference pairs by directly correcting or contaminating the original responses. Some approaches rely on extensive human annotations or ultra-large proprietary models (such as GPT-4V) to detect hallucinations and then rewrite the responses, therefore the scalability of feedback data is still limited. To address this issue, we propose leveraging the reference model itself to enhance the preference pairs in a self-correctional manner, without human or proprietary model intervention.

(a) Conventional RLAIF baselines generate feedback by using labeler models to distinguish preferences, leading to sub-optimal results. (b) Methods that rely on extensive manual annotation or proprietary models for feedback collection, compromising the scalability of feedback data. (c) We propose a topic-level self-correctional paradigm tailored for reducing hallucinations, through topic clustering and topic overwriting.

Topic-level Preference Overwriting

We propose a topic-level self-correctional paradigm tailored for reducing hallucinations, Topic-level Preference Overwriting (TPO). We adopt a deconfounded algorithm that replaces all topics involved in a complex response, with the best or worst alternatives resampled multiple times from the reference model itself on the same topic.

(1) Decomposing multiple responses generated by the reference model into sub-responses and resampling additional candidate sub-responses with wh-questions.(e.g., what, where, how) (2) Clustering all sub-responses into several distinct topics based on textual and visual semantics. (3) Scoring the sub-responses under each topic and selecting the sub-response with the highest and lowest score to construct topic-level preference pair for each topic. (4) Correcting the response by overwriting its sub-responses with topic-level preferences. (5) The reference model is fine-tuned with feedback data through DPO.

📃 Highlights

Without bells and whistles, TPO achieves state-of-the-art performance in trustworthiness across several hallucination benchmarks, reducing hallucination of the base model by ~92% on ObjectHal-Bench, and by ~38% on MMHal-Bench. We also align base model with the model itself as labeler, significantly reducing its own hallucinations (by ~88% on ObjectHal-Bench and by ~12% on MMHal-Bench) and breaking through its inherent limitations.

Data Scalability: TPO allows us to collect more feedback data for hallucination reduction at a low cost, without human or proprietary models intervention. As the data scale increases, the trustworthiness of the model continuously improves.

Feedback Quality: We compare the quality of preferred responses generated by TPO with those identified by the labeler model and original responses. We evaluate their informativeness and trustworthiness based on GPT-4V evaluation review. Different colors in the pie charts mark the number of winning responses. TPO outperforms its counterparts in both informativeness and trustworthiness.

(1) "Raw" represents a randomly generated response by reference model. (2) "Preferred by Labeler" indicates the best response among all candidates as judged by the labeler. (3) "TPO" represents the response generated by our method.

🖌 Examples

Correct answers and hallucinations are highlighted in color respectively.

A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs

Background and Motivation

Topic-level Preference Overwriting

📃 Highlights

🖌 Examples

BibTeX