Week 13 / March 2025

Verification Layers Emerge as the Architecture for Unreliable AI

Formal logic outperforms LLM self-checking, and newer models overgeneralize more than older ones

Synthesized using AI

Analyzed 115 papers. AI models can occasionally hallucinate, please verify critical details.

AI systems are reaching deployment thresholds without reaching reliability thresholds, and the research this week shows verification infrastructure is the missing layer. VeriSafe Agent demonstrates this most clearly: by translating natural language instructions into formal specifications and using rule-based verification to validate each mobile GUI agent action before execution, it achieved 94.33%-98.33% accuracy across 300 instructions in 18 apps—outperforming LLM-based verification by 16.33%-30.00% and increasing task completion rates by 90%-130%. The architectural insight is that probabilistic systems can't reliably validate their own probabilistic outputs. This pattern repeats across domains: LLM summarization of scientific research strips scope-limiting qualifiers in 26%-73% of cases even with explicit accuracy prompts, producing summaries nearly five times more likely to overgeneralize than human-authored ones. MatplotAlt found that state-of-the-art vision-language models hallucinate chart descriptions unless prompted with heuristic-based alt text or parsed data tables. The convergence is clear—generation and validation must be decoupled because foundation models won't become deterministic.

Two papers complicate this by showing where verification itself breaks down. Research on ableist hate moderation reveals that disabled users deeply distrust AI-based content filtering and want granular, user-configured controls by content type rather than platform-enforced policies. The verification problem here isn't technical accuracy—it's that centralized validation can't capture context-dependent harm. Meanwhile, g4r.org provides infrastructure for researchers to study LLM interactions systematically, acknowledging that we lack standardized tools to even measure these reliability gaps at scale. The tension: we're building verification layers for systems we don't yet know how to verify, and users don't trust the verification we do build.

The design challenge this creates is path dependency. Organizations deploying LLM automation now without verification infrastructure are locking in architectures that treat generation as reliable enough. VeriSafe Agent's 90%-130% task completion improvement shows the cost of that assumption. The window to build dual-track systems—where formal logic or rule-based validation acts as a gatekeeper—is narrow. Make architectural decisions assuming unreliability, or pay compounding error costs later.

Featured(1/6)

2503.18492

VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification

Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, Insik Shin

Preprint·2025-03-24

Stop relying on LLM self-checking for GUI automation. Integrate formal verification as a gatekeeper layer. Best for high-stakes mobile workflows where a single wrong tap compounds errors.

LLM-based mobile GUI agents automate tasks via natural language, but their probabilistic nature makes them unreliable—they execute actions that don't match user intent, breaking workflows.

Method: VeriSafe Agent translates natural language instructions into formal specifications, then uses rule-based verification to check each agent action before execution. Tested on 300 instructions across 18 mobile apps using GPT-4o, it achieved 94.33%-98.33% accuracy in verifying actions—outperforming existing LLM-based verification by 16.33%-30.00%—and increased task completion rates by 90%-130%.

Caveats: Tested only on mobile apps. Desktop GUI agents or web automation may require different formalization approaches.

Reflections: Can autoformalization scale to ambiguous instructions where user intent itself is underspecified? · How does verification latency affect user experience in real-time automation scenarios? · What happens when the formal specification itself misinterprets user intent?

ai-interactionmobile-interfacestrust-safety

2504.00025

Generalization Bias in Large Language Model Summarization of Scientific Research

Uwe Peters, Benjamin Chin-Yee

Preprint·2025-03-28

Don't use default LLM settings to summarize research for public consumption. Lower temperature settings and benchmark for generalization accuracy. If precision matters, human review is non-negotiable.

LLMs summarize scientific research for public audiences, but they may strip caveats and scope limitations, overgeneralizing findings beyond what studies actually support.

Method: Tested 10 LLMs on 4900 summaries of scientific texts. Even when explicitly prompted for accuracy, DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralized in 26% to 73% of cases. LLM summaries were nearly five times more likely to contain broad generalizations than human-authored summaries (OR = 4.85, 95% CI [3.06, 7.70]). Newer models performed worse than earlier ones.

Caveats: Tested on scientific texts. Generalization bias in other domains (news, policy) remains unverified.

Reflections: Why do newer models overgeneralize more than earlier versions—is it training data or architectural changes? · Can fine-tuning on scope-preserving summaries reduce generalization bias? · Do users actually prefer overgeneralized summaries because they're simpler, creating a misaligned incentive?

ai-interactionevaluation-methodstrust-safety

2503.21844

"Ignorance is Not Bliss": Designing Personalized Moderation to Address Ableist Hate on Social Media

Sharon Heung, Lucy Jiang, Shiri Azenkot, Aditya Vashistha

Preprint·2025-03-27

Don't build intensity-based filters or rely on AI rephrasing for ableist content. Offer type-based filtering with content warnings. Give users granular control—tolerance for hate varies widely even within disability communities.

Platform moderation fails to remove ableist hate, leaving disabled users exposed. Personalized filters could help, but it's unclear what configuration options users actually want.

Method: Interviewed 23 disabled social media users with design probes showing different filter configurations (intensity vs. type of ableism) and presentation options (AI rephrasing, content warnings). Participants preferred filtering by type of ableist speech and favored content warnings over AI rephrasing. They expressed deep distrust in AI-based moderation accuracy and showed varied tolerances for viewing ableist content.

Caveats: 23 participants, qualitative study. Quantitative validation of filter effectiveness and user adoption rates needed.

Reflections: How do you design type-based filters that scale across the evolving language of ableism? · Can users' distrust in AI moderation be mitigated with transparency about accuracy rates? · What happens when users' tolerance for ableist content changes over time—how should filters adapt?

accessibilitysocial-computingtrust-safetyethics

1 / 6

Featured

Findings(1/5)

Verification layers emerge as the architecture for unreliable automation·Personalized moderation shifts from platform policy to user-configured filtering·Shared control architectures replace omakase automation in assistive robotics·Jargon preservation replaces simplification in cross-disciplinary knowledge tools·Dynamic learning models replace cumulative assumptions in productivity measurement

LFM-based agents inherit probabilistic unreliability that makes direct automation dangerous. VeriSafe Agent introduces logic-based verification as a separate layer that validates actions before execution, reducing mobile GUI agent errors. This architectural pattern—generate, then verify—acknowledges that foundation models won't become deterministic, so systems must be designed around their failure modes. The implication: automation infrastructure now requires dual-track design where generation and validation are decoupled.

2503.18492

VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification

Surprises(1/3)

LLM summarization systematically overgeneralizes scientific findings beyond study scope·Password managers in VR face adoption barriers despite solving known security problems·Youth discuss privacy risks extensively but don't change behavior in practice

We assumed LLMs would democratize science literacy by making research accessible. Testing 10 prominent LLMs on scientific summarization shows they systematically omit scope-limiting details, producing generalizations broader than warranted by original studies. The accessibility gain comes with a systematic accuracy cost: summaries are more readable precisely because they strip the qualifiers that make conclusions valid. The tradeoff isn't bugs to fix—it's inherent to the compression task.

2504.00025

Generalization Bias in Large Language Model Summarization of Scientific Research

TOOLBOX(5)

GPT for Researchers (G4R)

Tool

Free web platform (g4r.org) enabling researchers to create customizable GPT interfaces for studies. Allows participants to interact with ChatGPT-like models, set constraints on topics/tone/response style, and download complete message exchange data between participants and GPT for analysis of human-AI communication patterns.

2503.18303

VeriSafe Agent (VSA)

Framework

Formal verification system for mobile GUI agents using autoformalization to translate natural language instructions into verifiable specifications. Achieves 94.33%-98.33% accuracy in verifying agent actions using GPT-4o, outperforming LLM-based methods by 16.33%-30.00%, and increases task completion rates by 90%-130% across 18 mobile apps.

2503.18492

MatplotAlt

Code

Open-source Python package for automatically generating alternative text for Matplotlib figures in Jupyter notebooks. Supports single-line code/command integration with heuristic and LLM-based generation methods. Improves accuracy by prompting GPT4-turbo with heuristic alt text or parsed data tables from figures.

2503.20089

DataWeaver

Tool

Integrated authoring system for data-driven narratives supporting bidirectional composition: visualization-to-text via call-out interactions (user-initiated highlights generating narrative content) and text-to-visualization (generating interactive visualizations from existing narratives). Evaluated with 13 participants for creating cohesive data stories anchored to data facts.

2503.22946

StreetScape

Tool

Tactile street puzzle system featuring modular 3D-printed tiles, tactile roadways, and customizable decorative elements for collaborative spatial learning. Designed through iterative process to enable blind and visually impaired children to construct cityscapes through gamified tactile interaction, promoting spatial reasoning and interdependence with sighted peers.

2503.21897

The case for delegated AI autonomy for Human AI teaming in healthcare

Proposes letting diagnostic AI decide *when* it needs human oversight based on case-specific confidence thresholds. Directly confronts the verification bottleneck with selective autonomy.

2503.20160

What is the role of human decisions in a world of artificial intelligence: an economic evaluation of human-AI collaboration in diabetic retinopathy screening

Analyzes 270 real screening decisions to quantify the economic value of human-AI collaboration. Rare empirical data on what humans actually add when AI is 'good enough.'

2504.13868

Diverse AI Personas Can Mitigate the Homogenization Effect in Human-AI Collaborative Ideation

Shows that GenAI makes everyone's ideas sound the same—unless you deliberately inject personality into the AI. Tests whether diverse bot personas restore collective creativity.

2503.22610

Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users

Surveys blind users about MLLMs as visual assistants, finding high adoption but critical gaps. Evaluates what actually matters when AI becomes someone's eyes.

2503.21094

GazeSwipe: Enhancing Mobile Touchscreen Reachability through Seamless Gaze and Finger-Swipe Integration

Combines eye tracking with thumb swipes to reach distant screen areas on large phones. Multimodal interaction that offloads targeting to gaze while keeping gesture familiar.

2503.17955

Human-AI Interaction and User Satisfaction: Empirical Evidence from Online Reviews of AI Products

Mines online reviews to test whether HAI design principles actually correlate with user satisfaction at scale. Rare large-N validation of interaction guidelines in the wild.

2503.19075

The Case for "Thick Evaluations" of Cultural Representation in AI

Argues that current AI culture evaluations are reductive and abstracted from how people define their own representation. Calls for ethnographic depth over scalable metrics.

2504.07971

SPHERE: An Evaluation Card for Human-AI Systems

Introduces a documentation framework for transparently reporting evaluation choices in human-AI systems. Meta-tool for making evaluation design decisions visible and comparable.

REFLECTION(4)

We built verifiers, not verification

AI systems now deploy faster than humans can reliably audit them. The research shows a consistent pattern: hallucinating chatbots, diagnostic tools requiring clinician oversight, and screening systems that still fail under human review—all pointing to the same bottleneck. We've optimized for system capability, not human capacity to catch what breaks.

AI accuracy has crossed the deployment threshold, but human verification has not kept pace—systems are 'good enough' to ship but not good enough to trust unsupervised. If we accept that humans will always be the verification layer, are we designing for the humans we have or the humans we wish we had?

1 / 4

Week 12March 2025

Week 14April 2025

ABOUT THIS ISSUE

How was this newsletter synthesized?

Methodology

This newsletter is generated by an AI pipeline (leveraging Anthropic Sonnet 4.5 & Haiku 4.5) that processes the metadata and abstracts of every new arXiv HCI paper from the past week—115 this issue. Each paper is scored on three dimensions: Practice (applicability for practitioners), Research (scientific contribution), and Strategy (industry implications), with scores from 1-5. Papers passing threshold are grouped into topic clusters, and each cluster is summarized to capture what that body of research is exploring.

Selection Criteria

The pipeline builds a curated selection that balances high scores with topic diversity—and deliberately includes at least one 'contrarian' paper that challenges prevailing assumptions. This selection is then analyzed to identify key findings (patterns across multiple papers) and surprises (results that contradict conventional wisdom). A narrative synthesis ties the week's research together under a unifying frame.

Key Themes Discovered

Field Report: ai-interaction

Trust, Reliability, and Human Alignment

This cluster examines how humans calibrate trust in AI systems and what design choices enable reliable collaboration. Core tensions emerge: LLMs systematize errors (overgeneralization in science summaries, hallucinations in customer service), yet users struggle to detect failures. Research spans verification mechanisms, evaluation frameworks, and interaction design—from formal verification of GUI agents to thick cultural evaluation. The work is primarily empirical and design-focused, targeting practitioners building AI-assisted workflows and safety-critical systems.

1/10

Diverse AI Personas Can Mitigate the Homogenization Effect in Human-AI Collaborative Ideation

2503.17955

Human-AI Interaction and User Satisfaction: Empirical Evidence from Online Reviews of AI Products

2503.18419

Synthesized using AI

VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification

Generalization Bias in Large Language Model Summarization of Scientific Research

"Ignorance is Not Bliss": Designing Personalized Moderation to Address Ableist Hate on Social Media

VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification

Generalization Bias in Large Language Model Summarization of Scientific Research

GPT for Researchers (G4R)

VeriSafe Agent (VSA)

MatplotAlt

DataWeaver

StreetScape

The case for delegated AI autonomy for Human AI teaming in healthcare

What is the role of human decisions in a world of artificial intelligence: an economic evaluation of human-AI collaboration in diabetic retinopathy screening

Diverse AI Personas Can Mitigate the Homogenization Effect in Human-AI Collaborative Ideation

Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users

GazeSwipe: Enhancing Mobile Touchscreen Reachability through Seamless Gaze and Finger-Swipe Integration

Human-AI Interaction and User Satisfaction: Empirical Evidence from Online Reviews of AI Products

The Case for "Thick Evaluations" of Cultural Representation in AI

SPHERE: An Evaluation Card for Human-AI Systems

We built verifiers, not verification

How was this newsletter synthesized?

Methodology

Selection Criteria

Key Themes Discovered

Field Report: ai-interaction

Trust, Reliability, and Human Alignment

Top Papers in this Theme

The Case for "Thick Evaluations" of Cultural Representation in AI

A Survey on (M)LLM-Based GUI Agents

Diverse AI Personas Can Mitigate the Homogenization Effect in Human-AI Collaborative Ideation

Human-AI Interaction and User Satisfaction: Empirical Evidence from Online Reviews of AI Products

Generative AI in Knowledge Work: Design Implications for Data Navigation and Decision-Making