[Defense] A Human-Centered Evaluation of Rationality and Legal Reasoning in Large Language Models (LLMs): An Audit Guided by the Cattell-Horn-Carroll Theory
Friday, December 6, 2024
11:00 am - 12:30 pm
In
Partial
Fulfillment
of
the
Requirements
for
the
Degree
of
Doctor
of
Philosophy
Dana
Alsagheer
will
defend
her
proposal
A
Human-Centered
Evaluation
of
Rationality
and
Legal
Reasoning
in
Large
Language
Models
(LLMs):
An
Audit
Guided
by
the
Cattell-Horn-Carroll
Theory
Abstract
Large language models (LLMs), such as OpenAI鈥檚 GPT-4, are increasingly deployed in high-stakes domains like law, healthcare, scientific research, and text generation. Despite their impressive capabilities, these models function as 鈥渂lack-box鈥 systems, prompting significant concerns around transparency, reliability, and ethical integrity. Key issues include biases embedded in training data, inconsistencies in reasoning, limited reproducibility, and an inflexible approach to adapting new information. These limitations present notable risks in domains where fairness, accountability, and transparency are paramount. Traditional evaluations tend to focus on surface-level accuracy, often overlooking the complex cognitive mechanisms underlying LLM responses and the dynamic shifts in user beliefs that develop through continued interaction. This study explores the real implications of LLMs passing domain-specific exams, particularly regarding cognitive robustness and applicability to real-world scenarios. To achieve a thorough evaluation, we introduce a two-way auditing framework aimed at comprehensive, accurate, and reliable assessments of LLM capabilities. The first component of this framework is the Human Generalization Function (HGF). In this interactive assessment, users use the model to evaluate its reasoning and coherence across multiple interactions. The HGF tracks shifts in user perception, observing how users adjust their understanding of the model鈥檚 abilities over time based on its responses, thus providing insights into the model鈥檚 perceived trustworthiness and consistency. The second component involves simulated exams administered by domain experts to assess competencies in narrow domains, such as legal reasoning. These exams offer a transparent, standardized metric for evaluating LLM performance, providing a clear benchmark to compare model capabilities against established human standards in specialized fields. By integrating these methods, our framework surpasses basic accuracy metrics to investigate LLM performance across both broad and narrow cognitive abilities, assessing whether improvements in specific skills indicate isolated enhancement or suggest broader generalization in line with the Cattell-Horn-Carroll (CHC) theory of intelligence. To support this investigation, we audit five distinct LLMs and offer a comparative analysis highlighting each model鈥檚 strengths and limitations in various high-stakes applications. This detailed evaluation encompasses domains requiring precision, ethical reasoning, and logical consistency, advancing our understanding of how these models align鈥攐r fail to align鈥攚ith human cognitive standards in critical decision-making contexts. Ultimately, our framework seeks to establish a new standard for ethical AI governance by merging rigorous cognitive assessments with dynamic user feedback. This approach enables a transparent, reliable, and human-aligned evaluation of LLMs, illuminating their capabilities and limitations. In doing so, we provide essential guidance for the future development of AI that prioritizes transparency, trustworthiness, and alignment with human values鈥攅specially in high-stakes areas where accuracy, fairness, and accountability are crucial.
Friday,
December
6,
2024
11:00
AM
-
12:30
PM
PGH 591
Dr. Weidong (Larry) Shi, proposal advisor
Faculty, students, and the general public are invited.

- Location
- Room 591, Philip Guthrie Hoffman Hall (PGH), 3551 Cullen Blvd, Houston, TX 77204, USA