An undergraduate from a well-established Bay Area university was working in Giorgio Lagna's cardiovascular research lab at UCSF. She could design a control experiment most postdocs would have struggled with. Lagna, a research scientist and biologist by training, asked her what classes had helped her learn to do that.
"Not in class," she told him. She'd attended lectures and taken exams. The things she actually knew, she'd learned in internships, where people had asked her hard questions and made her defend her answers.
That conversation crystallized something Lagna had already started to suspect. The students with the strongest transcripts sometimes struggled at the bench, and he saw students with unremarkable transcripts outperform their peers in certain exercises. Whatever the transcript credential was measuring, it was not the biggest priority in his classroom.
An assessment crisis that started before ChatGPT
Lagna started teaching in 2018. By spring of 2020, he was getting batches of student essays that were essentially the same essay. Lightly edited, but functionally identical. This was before any current AI tools existed. Students were copying from each other and from the same online sources.
"AI didn't break assessment," he said. "Assessment was already broken."
The standard response to AI in higher ed has been to retreat to lower-tech assessment methods. Recent reporting notes that Penn, Cornell, NYU, and others have been moving back to oral exams and handwritten in-class work because they're worried about AI cheating. Lagna agrees oral exams are a better measure of what a student actually knows. He just doesn't think going back to a number two pencil is the move. "You are not testing their knowledge," he says of the locked-room, paper-and-pencil exam. "You are preparing them for the last century's jobs."
The Crucible is built on a different bet. If students remember the questions they had to defend and forget the lectures they sat through, the design job is to put those questions back in the classroom. AI is the only tool Lagna has seen that can do that for every student, every week.
Inside The Crucible (the Bio 2 course Lagna teaches at Santa Clara University), 13 AI tutors now run alongside the curriculum. None of them are built to give students answers.
Three kinds of Gems, one of which lies on purpose
These AI tutors come in three types.
Tollbooth Gems run before class. A student opens one the night before, gets a single content question, and works through a 5 to 10 minute conversation. The Gem is engineered to refuse to give answers. It asks one question at a time, checks the student's reasoning at every step, and remembers what each student worked through the week before. Somewhere in the exchange, it plants a deliberate factual error. A few turns later, it asks whether the student caught it.
Boss Battle Gems are used during exams and simulated crisis scenarios. A student team might walk into an emergency room with a patient in a sickle cell crisis. The Gem acts as their on-call AI consultant, and it lies on purpose. The students don't know where the errors are. Their job is to find them, tag them, and explain what's wrong before the simulated patient deteriorates. The more errors they catch, the higher they score.
"Hallucinations are not a bug in my classroom," Lagna says. "They are the assignment."
Certification Gems run the oral exam at the end of each unit. Five to ten minutes, voice conversation, in real time. The Gem opens with one question about the unit, listens, asks the student to go deeper or defend something that sounds wrong, and produces a transcript and an encrypted score that posts directly to the gradebook in Canvas.
The whole design rests on a learning science principle Lagna credits to Robert Bjork at UCLA: durable learning requires desirable difficulty. To maximize learning and growth, the student has to struggle a little. Quick answers mean quick forgetting. The Gem keeps the conversation in the productive zone where effort gets rewarded and guessing doesn't.
What the data is starting to show
Earlier this year, Lagna and two Santa Clara colleagues, Brian Bayless and Denise Krane, published a study in the Journal of College Science Teaching presenting what they call the DISCO framework. The headline finding: the students who came in as strong writers stayed strong. The students who came in struggling improved substantially.
"AI didn't level the top of the curve," Lagna says. "It lifted the bottom."
This equity story matters most at Foothill, where Lagna's other classroom sits. The MESA program there serves first-generation and historically underrepresented students in STEM. Authentic oral assessment, the kind that's been a privilege of Oxford tutorials for centuries, was never going to reach a 50-person community college class. The students in those classrooms, many of them working night shifts or raising kids or learning in a second language, were not going to get a 5-minute Socratic conversation with a faculty member before every class meeting. Now they can. NYU recently ran a comparable AI-powered oral exam pilot for $15 across 36 students, or about 42 cents per student.
Lagna's also running a CRISPR research project with those MESA students. They're designing a treatment for Alkaptonuria, a rare disease caused by a single gene that's mutated 323 different ways across the patient population. The system uses six different CRISPR and transgenic approaches that, together, could target every one of those mutations. With AI handling modeling and verification at every step, students who don't yet have advanced biology or programming skills can run the verification path that catches when the AI gets things wrong.
"The unit of research is no longer the single principal investigator with the grant," Lagna says. "The unit is now a small team of humans and agents. And a community college student can be one of those humans."
The reframe higher ed keeps missing
The dominant conversation in higher ed about AI is still about cheating. Lagna thinks that's the wrong question. The right question is why the assignment was something a chatbot could complete in the first place, saying "AI is not a threat to academic integrity. It is an indictment of academic assessment."
What he's done in The Crucible is change the path of least resistance. In a classic essay, the easiest move is to copy from a chatbot. In a Tollbooth Gem conversation, the easiest move is to think. The gem refuses to give an answer. It pushes back on text that sounds suspiciously polished. It refers to something the student said three turns ago and asks them to connect it to what they're answering now. A determined student can still cheat; Lagna doesn't pretend otherwise. What he's done is make cheating more work than learning.
He’s found some validation for his Tollbooth design principle in his results with students, as well as less traditional sources. He noted that the day before this episode was recorded, Pope Leo XIV released his first encyclical, Magnifica Humanitas. Chapter three addresses education in the age of AI. The line that stopped Lagna: that the desire to ask questions in young people must not be extinguished by perfect machines that make human thought seem useless.
That, Lagna said, is almost word-for-word how he approached building the Tollbooth. He built an agent that refuses to be perfect on purpose, so students keep asking questions.
To hear the full conversation, including the student who debunked an anti-vaccine influencer Gem with a metaphor about sandwiches, and what Lagna thinks higher ed is missing about agent engineering, find the new Educast 3000 episode here.