Assessment

“It Would Be Cool to Make Up My Own Activities”: Youth Voice in STEM Teaching and Learning

Fostering youth voice means supporting young people in expressing their ideas, taking ownership of their learning, and engaging with their communities in meaningful and impactful ways. Out-of-school-time (OST) science, technology, engineering, and math (STEM) programs have long provided these opportunities, empowering youth to drive their learning forward and see themselves as active contributors to the world around them.

Author/Presenter

Victoria Oliveira

Virginia Andrews

Patricia J. Allen

Gil G. Noam

Lead Organization(s)
Year
2025
Short Description

For the promotion of youth voice to be successful, out-of-school-time (OST) program facilitators and classroom teachers need a common understanding of what quality looks and sounds like and support for implementing higher-quality instructional strategies. For well over a decade, the Dimensions of Success (DoS) observation system has provided such support in OST settings and, more recently, in middle-grade classrooms. In this article, we first demonstrate how DoS defines quality Youth Voice in OST and classroom settings through four vignettes based on observations of grade 5–8 classrooms and OST program observations, then provide strategies for educators to promote higher-quality Youth Voice by building on youth ideas and encouraging decision-making that drives their STEM learning forward.

Culturally and Linguistically Sustaining Formative Assessment in Science and Engineering: Highlighting Multilingual Girls’ Linguistic, Epistemic, and Spatial Brilliances

Background
This study advances understandings of formative assessment by introducing the Culturally and Linguistically Sustaining Formative Assessment (CLSA) framework, grounded in relational and embodied perspectives and culturally sustaining pedagogy. While formative assessment is widely recognized as a process for supporting learning, less is known about how it can be enacted in culturally and linguistically sustaining ways.

Author/Presenter

Shakhnoza Kayumova

Akira Harper

Tia Madkins

Esma Nur Kahveci

Year
2025
Short Description

This study advances understandings of formative assessment by introducing the Culturally and Linguistically Sustaining Formative Assessment (CLSA) framework, grounded in relational and embodied perspectives and culturally sustaining pedagogy. While formative assessment is widely recognized as a process for supporting learning, less is known about how it can be enacted in culturally and linguistically sustaining ways.

Unveiling Scoring Processes: Dissecting the Differences Between LLMs and Human Graders in Automatic Scoring

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI’s scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students’ written responses to science tasks and their alignment with human scores.

Author/Presenter

Xuansheng Wu

Padmaja Pravin Saraf

Gyeonggeon Lee

Ehsan Latif

Ninghao Liu

Xiaoming Zhai

Lead Organization(s)
Year
2025
Short Description

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI’s scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students’ written responses to science tasks and their alignment with human scores. We also examine whether enhancing the alignments can improve scoring accuracy.

Unveiling Scoring Processes: Dissecting the Differences Between LLMs and Human Graders in Automatic Scoring

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI’s scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students’ written responses to science tasks and their alignment with human scores.

Author/Presenter

Xuansheng Wu

Padmaja Pravin Saraf

Gyeonggeon Lee

Ehsan Latif

Ninghao Liu

Xiaoming Zhai

Lead Organization(s)
Year
2025
Short Description

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI’s scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students’ written responses to science tasks and their alignment with human scores. We also examine whether enhancing the alignments can improve scoring accuracy.

Characterizing Teacher Knowledge Tests and Their Use in the Mathematics Education Literature

We present findings from an analysis of tests of teacher mathematical knowledge identified over a 20-year period of mathematics education literature. This analysis is part of a larger project aimed at developing a repository of instruments and their associated validity evidence for use in mathematics education. We report on how these tests are discussed in the literature, with a focus on validity arguments and evidence. A key finding is that these tests are often presented in ways that do not support their use by the mathematics education community.

Author/Presenter

Pavneet Kaur Bharaj

Michele Carney

Heather Howell

Wendy M. Smith

James Smith

Year
2025
Short Description

We present findings from an analysis of tests of teacher mathematical knowledge identified over a 20-year period of mathematics education literature, and report on how these tests are discussed in the literature, with a focus on validity arguments and evidence.

NLP-Enabled Automated Assessment of Scientific Explanations: Towards Eliminating Linguistic Discrimination

As use of artificial intelligence (AI) has increased, concerns about AI bias and discrimination have been growing. This paper discusses an application called PyrEval in which natural language processing (NLP) was used to automate assessment and provide feedback on middle school science writing without linguistic discrimination. Linguistic discrimination in this study was operationalized as unfair assessment of scientific essays based on writing features that are not considered normative such as subject-verb disagreement.

Author/Presenter

ChanMin Kim

Rebecca J. Passonneau

Eunseo Lee

Mahsa Sheikhi Karizaki

Dana Gnesdilow

Sadhana Puntambekar

Year
2025
Short Description

As use of artificial intelligence (AI) has increased, concerns about AI bias and discrimination have been growing. This paper discusses an application called PyrEval in which natural language processing (NLP) was used to automate assessment and provide feedback on middle school science writing without linguistic discrimination.

NLP-Enabled Automated Assessment of Scientific Explanations: Towards Eliminating Linguistic Discrimination

As use of artificial intelligence (AI) has increased, concerns about AI bias and discrimination have been growing. This paper discusses an application called PyrEval in which natural language processing (NLP) was used to automate assessment and provide feedback on middle school science writing without linguistic discrimination. Linguistic discrimination in this study was operationalized as unfair assessment of scientific essays based on writing features that are not considered normative such as subject-verb disagreement.

Author/Presenter

ChanMin Kim

Rebecca J. Passonneau

Eunseo Lee

Mahsa Sheikhi Karizaki

Dana Gnesdilow

Sadhana Puntambekar

Year
2025
Short Description

As use of artificial intelligence (AI) has increased, concerns about AI bias and discrimination have been growing. This paper discusses an application called PyrEval in which natural language processing (NLP) was used to automate assessment and provide feedback on middle school science writing without linguistic discrimination.

A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14).

Author/Presenter

Sophie Grace King

Jonathan David Bostic

Toni A. May

Gregory E. Stone

Year
2025
Short Description

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14). The second aim of this mixed-methods research was to unpack consequences of testing validity evidence related to the results and test interpretations, leveraging the voices of participants.

A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14).

Author/Presenter

Sophie Grace King

Jonathan David Bostic

Toni A. May

Gregory E. Stone

Year
2025
Short Description

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14). The second aim of this mixed-methods research was to unpack consequences of testing validity evidence related to the results and test interpretations, leveraging the voices of participants.

A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14).

Author/Presenter

Sophie Grace King

Jonathan David Bostic

Toni A. May

Gregory E. Stone

Year
2025
Short Description

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14). The second aim of this mixed-methods research was to unpack consequences of testing validity evidence related to the results and test interpretations, leveraging the voices of participants.