Assessing Genai Models Ability To Evaluate Content Credibility

Bonisiwe Shabane

-Nov 30, 2025, 10:09 PM

assessing genai models ability to evaluate content credibility

by jsendak | Feb 24, 2025 | AI | 0 comments arXiv:2502.14943v1 Announce Type: new Abstract: Despite recent advances in understanding the capabilities and limits of generative artificial intelligence (GenAI) models, we are just beginning to understand their capacity to assess and reason about the... We evaluate multiple GenAI models across tasks that involve the rating of, and perceived reasoning about, the credibility of information. The information in our experiments comes from content that subnational U.S. politicians post to Facebook. We find that GPT-4o, one of the most used AI models in consumer applications, outperforms other models, but all models exhibit only moderate agreement with human coders.

Importantly, even when GenAI models accurately identify low-credibility content, their reasoning relies heavily on linguistic features and “hard” criteria, such as the level of detail, source reliability, and language formality, rather than an understanding... We also assess the effectiveness of summarized versus full content inputs, finding that summarized content holds promise for improving efficiency without sacrificing accuracy. While GenAI has the potential to support human fact-checkers in scaling misinformation detection, our results caution against relying solely on these models. Generative artificial intelligence (GenAI) models have made significant advancements in recent years, but the extent to which they can accurately assess and reason about the veracity of content is still being explored. In this study, we evaluate multiple GenAI models’ ability to rate and reason about the credibility of information, using content posted by subnational U.S. politicians on Facebook as our dataset.

One of the most widely used GenAI models, GPT-4o, emerged as the top performer in our evaluation, outperforming other models. However, it is important to note that even the best-performing models exhibited only moderate agreement with human coders. This suggests that although GenAI models have made strides in this area, there is still progress to be made in accurately assessing the veracity of content. Furthermore, our findings reveal that GenAI models heavily rely on linguistic features and “hard” criteria such as level of detail, source reliability, and language formality when determining the credibility of content. While these criteria can be indicative of low-credibility content, they do not necessarily capture the underlying veracity of the information. This highlights the need for GenAI models to develop a deeper understanding of veracity, beyond surface-level indicators.

Despite recent advances in understanding the capabilities and limits of generative artificial intelligence (GenAI) models, we are just beginning to understand their capacity to assess and reason about the veracity of content. We evaluate multiple GenAI models across tasks that involve the rating of, and perceived reasoning about, the credibility of information. The information in our experiments comes from content that subnational U.S. politicians post to Facebook. We find that GPT-4o, one of the most used AI models in consumer applications, outperforms other models, but all models exhibit only moderate agreement with human coders. Importantly, even when GenAI models accurately identify low-credibility content, their reasoning relies heavily on linguistic features and “hard” criteria, such as the level of detail, source reliability and language formality, rather than an understanding...

We also assess the effectiveness of summarized versus full content inputs, finding that summarized content holds promise for improving efficiency without sacrificing accuracy. While GenAI has the potential to support human fact-checkers in scaling misinformation detection, our results caution against relying solely on these models. Generative Artificial Intelligence (GenAI) has transformed workflows across industries and academia since its surge in popularity in 2022 (Abels, 2023). In scientific research, its adoption has accelerated at an unprecedented pace (Zhao et al., 2023), demonstrating transformative potential in diverse applications. Researchers have successfully employed GenAI models to impersonate survey respondents (Bisbee et al., 2024), generate counterfactual images for experimental research (Davidson, 2024), annotate text with human-comparable accuracy(Gilardi et al., 2023), identifying conspiracy theories (Diab... Among these diverse applications of GenAI, a fundamental question remains about its ability to assess content credibility.

We assess the capacity of GenAI models to rate the veracity of content. The online proliferation of unreliable and misleading information, including misinformation, poses threats to democracies and societies, spurring political violence, undermining trust in democratic institutions globally, and endangering public health (Eliassi-Rad et al., 2020; Lazer... Combating misinformation requires reliable detection tools; however, the complexity of fact-checking remains largely beyond the capabilities of even state-of-the-art AI systems (Neumann et al., 2024). Given these limitations, one well-established alternative is to assess content reliability by using the credibility of source domains as a proxy (Lasser et al., 2023; Guess et al., 2020). This domain-based approach raises concerns about false positives, as not all content from unreliable domains is misinformation. Scholars have explored the potential of Large Language Models (LLMs) as fact-checkers, building on foundational work showing that language models can verify claims without external knowledge bases (Lee et al., 2020).

For example, using datasets like news headlines labeled for credibility (Gabriel et al., 2021), researchers found that GenAI models without fine-tuning achieve moderate performance in detecting political misinformation(Ziems et al., 2024). This suggests that if GenAI can reliably assess credibility through zero-shot prompting, it could alleviate human resource demands in large-scale fact-checking. To advance this line of inquiry, we explore the perceived reasoning patterns that emerge during content credibility assessment in zero-shot settings, offering insights into the internal processes of GenAI. Specifically, we address the following research questions: RQ1 (Viability): Can GenAI models match or exceed the reliability of human coders? RQ2 (Efficiency): Can summaries generated by zero-shot GenAI models provide efficient credibility ratings?

RQ3 (Functionality): How do GenAI models reason about information credibility? Powering academic institutions and government organizations Clarivate is home to leading research, education and library solutions, including Web of Science™, ProQuest™, Ex Libris™ and Innovative™. See the individuals who have demonstrated significant and broad influence in their fields of research. The Institute for Scientific Information Fostering the scientometric community through research and collaboration.

A NIST evaluation program to support research in Generative AI technologies. NIST GenAI is a new evaluation program administered by the NIST Information Technology Laboratory to assess generative AI technologies developed by the research community from around the world. NIST GenAI is an umbrella program that supports various evaluations for research and measurement science in Generative AI by providing a platform for Test and Evaluation. The objectives of the NIST GenAI evaluation include but are not limited to: NIST GenAI program provides rigorous, science-based testing and evaluation (T&E) of Generators (generative AI), Detectors (discriminative AI), and Prompters (prompt engineering) across multiple modalities (text, image, code, audio, and video). GenAI is an adversarial testing framework: Generators create high-quality synthetic data (AI content) using frontier AI models, while the detectors develop AI tools to identify whether the content is AI-generated and believable.

Prompters provide a different quality of content by employing various prompting strategies. In today’s fast-paced digital landscape, businesses increasingly rely on generative AI (genAI) to create personalized content, automate responses, and enhance search experiences. However, the effectiveness of these AI-generated outputs heavily depends on their quality. Evaluating and maintaining high standards for genAI content is not just a technical task but a strategic necessity that directly impacts user satisfaction, engagement, and business outcomes. This blog provides essential tips for evaluating genAI content to meet businesses’ needs and drive success in search applications. GenAI content is produced by advanced AI models trained on vast datasets to generate text, images, or other outputs designed to meet specific business needs.

From personalized product descriptions to dynamic pricing strategies, genAI content adapts in real-time based on user behavior and data inputs. However, the value of this content depends on the quality of the underlying models and data. To keep content relevant, it’s crucial to assess whether the genAI outputs meet their intended goals. This involves evaluating clarity, accuracy, and contextual appropriateness to ensure that AI-generated content enhances user experiences. By understanding what qualifies as genAI content, businesses can identify areas for refinement, keeping their content aligned with user expectations and search application needs. Quantitative evaluation involves systematically measuring content quality using specific metrics like readability scores, cognitive load assessments, and engagement metrics.

Readability is crucial for making AI-generated content accessible and easily understood by the target audience. Using readability tools such as the Flesch-Kincaid score helps assess the complexity of content, adjusting language and sentence structure to match the audience’s reading level. For example, an ecommerce site with overly complex product descriptions may frustrate customers, leading to missed opportunities. Simplifying these descriptions using readability assessments can significantly improve user engagement and conversion rates. Similarly, cognitive load assessments help determine how mentally demanding the content is for users. Reducing complexity by avoiding jargon and shortening sentences can make the content more approachable, leading to a smoother user experience.

Assessing Genai Models Ability To Evaluate Content Credibility

People Also Search

By Jsendak | Feb 24, 2025 | AI | 0

Importantly, Even When GenAI Models Accurately Identify Low-credibility Content, Their

One Of The Most Widely Used GenAI Models, GPT-4o, Emerged

Despite Recent Advances In Understanding The Capabilities And Limits Of

We Also Assess The Effectiveness Of Summarized Versus Full Content