Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Ming Ding, Federico Soldà, Weixuan Yuan, Rasmus Kyng, The Education Column by Dennis Komm and Thomas Zeume

Abstract


As large language models (LLMs) advance, their role in higher education,
particularly in free-response problem-solving, requires careful examination.
This study assesses the performance of GPT-4o and o1-preview under realistic
educational conditions in an undergraduate algorithms course. Anonymous
GPT-generated solutions to take-home exams were graded by teaching assistants
unaware of their origin. Our analysis examines both coarse-grained
performance (scores) and fine-grained reasoning quality (error patterns). Results
show that GPT-4o consistently struggles, failing to reach the passing
threshold, while o1-preview performs significantly better, surpassing the
passing score and even exceeding the student median in certain exercises.
However, both models exhibit issues with unjustified claims and misleading
arguments. These findings highlight the need for robust assessment strategies
and AI-aware grading policies in education.


Full Text:

PDF

Refbacks

  • There are currently no refbacks.