Assessment

Engaging students in scientific modeling practice is critical for developing their competence in using scientific knowledge to explain phenomena and design solutions. Student-drawn models are frequently used to investigate students’ proficiency in scientific modeling. However, scoring student-drawn models is time-consuming and requires technical expertise. The recently released GPT-4V(ision) provides a unique opportunity to facilitate the automatic scoring of scientific models with its image classification capability.

Access Link

https://link.springer.com/article/10.1007/s10956-025-10262-9

Author/Presenter

Gyeonggeon Lee

Xiaoming Zhai

Lead Organization(s)

University of Georgia (UGA)

Year

2025

Short Description

Read more about NERIF: GPT-4V for Automatic Scoring of Drawn Models

How Teaching Practices Relate to Early Mathematics Competencies: A Non-Linear Modeling Perspective

Access Link

https://www.mdpi.com/2227-7102/15/9/1175

Author/Presenter

Yixiao Dong

Douglas H. Clements

Christina Mulcahy

Julie Sarama

Lead Organization(s)

University of Denver (DU)

Year

2025

Short Description

The significance of children’s mathematical competence during the early years is well established; however, the methods for developing such competencies remain less understood. Specifically, there is a need to identify what constitutes high-quality educational environments and effective instruction. This study employed innovative analytical techniques to evaluate the scoring and interpretation of an existing domain-specific observational measure: the Classroom Observation of Early Mathematics Environment and Teaching (COEMET).

Read more about How Teaching Practices Relate to Early Mathematics Competencies: A Non-Linear Modeling Perspective

“It Would Be Cool to Make Up My Own Activities”: Youth Voice in STEM Teaching and Learning

Fostering youth voice means supporting young people in expressing their ideas, taking ownership of their learning, and engaging with their communities in meaningful and impactful ways. Out-of-school-time (OST) science, technology, engineering, and math (STEM) programs have long provided these opportunities, empowering youth to drive their learning forward and see themselves as active contributors to the world around them.

Access Link

https://www.tandfonline.com/doi/full/10.1080/24758779.2025.2525822

Author/Presenter

Victoria Oliveira

Virginia Andrews

Patricia J. Allen

Gil G. Noam

Lead Organization(s)

McLean Hospital

Year

2025

Short Description

For the promotion of youth voice to be successful, out-of-school-time (OST) program facilitators and classroom teachers need a common understanding of what quality looks and sounds like and support for implementing higher-quality instructional strategies. For well over a decade, the Dimensions of Success (DoS) observation system has provided such support in OST settings and, more recently, in middle-grade classrooms. In this article, we first demonstrate how DoS defines quality Youth Voice in OST and classroom settings through four vignettes based on observations of grade 5–8 classrooms and OST program observations, then provide strategies for educators to promote higher-quality Youth Voice by building on youth ideas and encouraging decision-making that drives their STEM learning forward.

Read more about “It Would Be Cool to Make Up My Own Activities”: Youth Voice in STEM Teaching and Learning

Culturally and Linguistically Sustaining Formative Assessment in Science and Engineering: Highlighting Multilingual Girls’ Linguistic, Epistemic, and Spatial Brilliances

Engineering

Science

Assessment

Broadening Participation

Multilingual Learners

Background
This study advances understandings of formative assessment by introducing the Culturally and Linguistically Sustaining Formative Assessment (CLSA) framework, grounded in relational and embodied perspectives and culturally sustaining pedagogy. While formative assessment is widely recognized as a process for supporting learning, less is known about how it can be enacted in culturally and linguistically sustaining ways.

Access Link

https://www.tandfonline.com/doi/full/10.1080/10508406.2025.2529170

Author/Presenter

Shakhnoza Kayumova

Akira Harper

Tia Madkins

Esma Nur Kahveci

Lead Organization(s)

University of Massachusetts, Dartmouth (UMass Dartmouth)

Year

2025

Short Description

This study advances understandings of formative assessment by introducing the Culturally and Linguistically Sustaining Formative Assessment (CLSA) framework, grounded in relational and embodied perspectives and culturally sustaining pedagogy. While formative assessment is widely recognized as a process for supporting learning, less is known about how it can be enacted in culturally and linguistically sustaining ways.

Read more about Culturally and Linguistically Sustaining Formative Assessment in Science and Engineering: Highlighting Multilingual Girls’ Linguistic, Epistemic, and Spatial Brilliances

Unveiling Scoring Processes: Dissecting the Differences Between LLMs and Human Graders in Automatic Scoring

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI’s scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students’ written responses to science tasks and their alignment with human scores.

Access Link

https://link.springer.com/article/10.1007/s10758-025-09836-8

Author/Presenter

Xuansheng Wu

Padmaja Pravin Saraf

Gyeonggeon Lee

Ehsan Latif

Ninghao Liu

Xiaoming Zhai

Lead Organization(s)

University of Georgia (UGA)

Year

2025

Short Description

Read more about Unveiling Scoring Processes: Dissecting the Differences Between LLMs and Human Graders in Automatic Scoring

Unveiling Scoring Processes: Dissecting the Differences Between LLMs and Human Graders in Automatic Scoring

Access Link

https://link.springer.com/article/10.1007/s10758-025-09836-8

Author/Presenter

Xuansheng Wu

Padmaja Pravin Saraf

Gyeonggeon Lee

Ehsan Latif

Ninghao Liu

Xiaoming Zhai

Lead Organization(s)

University of Georgia (UGA)

Year

2025

Short Description

Read more about Unveiling Scoring Processes: Dissecting the Differences Between LLMs and Human Graders in Automatic Scoring

Characterizing Teacher Knowledge Tests and Their Use in the Mathematics Education Literature

We present findings from an analysis of tests of teacher mathematical knowledge identified over a 20-year period of mathematics education literature. This analysis is part of a larger project aimed at developing a repository of instruments and their associated validity evidence for use in mathematics education. We report on how these tests are discussed in the literature, with a focus on validity arguments and evidence. A key finding is that these tests are often presented in ways that do not support their use by the mathematics education community.

Access Link

https://pubs.nctm.org/view/journals/jrme/56/4/article-p212.xml

Author/Presenter

Pavneet Kaur Bharaj

Michele Carney

Heather Howell

Wendy M. Smith

James Smith

Lead Organization(s)

Boise State University

Educational Testing Service (ETS)

Arizona State University (ASU)

North Carolina State University (NCSU)

Year

2025

Short Description

We present findings from an analysis of tests of teacher mathematical knowledge identified over a 20-year period of mathematics education literature, and report on how these tests are discussed in the literature, with a focus on validity arguments and evidence.

Read more about Characterizing Teacher Knowledge Tests and Their Use in the Mathematics Education Literature

NLP-Enabled Automated Assessment of Scientific Explanations: Towards Eliminating Linguistic Discrimination

Science

Assessment

Educational Technology

Equity

Middle

As use of artificial intelligence (AI) has increased, concerns about AI bias and discrimination have been growing. This paper discusses an application called PyrEval in which natural language processing (NLP) was used to automate assessment and provide feedback on middle school science writing without linguistic discrimination. Linguistic discrimination in this study was operationalized as unfair assessment of scientific essays based on writing features that are not considered normative such as subject-verb disagreement.

Access Link

https://bera-journals.onlinelibrary.wiley.com/doi/10.1111/bjet.13596

Author/Presenter

ChanMin Kim

Rebecca J. Passonneau

Eunseo Lee

Mahsa Sheikhi Karizaki

Dana Gnesdilow

Sadhana Puntambekar

Lead Organization(s)

Pennsylvania State University (Penn State)

University of Wisconsin-Madison (UW-Madison)

Year

2025

Short Description

Read more about NLP-Enabled Automated Assessment of Scientific Explanations: Towards Eliminating Linguistic Discrimination

NLP-Enabled Automated Assessment of Scientific Explanations: Towards Eliminating Linguistic Discrimination

Science

Assessment

Educational Technology

Equity

Middle

Access Link

https://bera-journals.onlinelibrary.wiley.com/doi/10.1111/bjet.13596

Author/Presenter

ChanMin Kim

Rebecca J. Passonneau

Eunseo Lee

Mahsa Sheikhi Karizaki

Dana Gnesdilow

Sadhana Puntambekar

Lead Organization(s)

Pennsylvania State University (Penn State)

University of Wisconsin-Madison (UW-Madison)

Year

2025

Short Description

Read more about NLP-Enabled Automated Assessment of Scientific Explanations: Towards Eliminating Linguistic Discrimination

A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

Mathematics

Assessment

Middle

Access Link

https://www.mdpi.com/2227-7102/15/6/680

Author/Presenter

Sophie Grace King

Jonathan David Bostic

Toni A. May

Gregory E. Stone

Lead Organization(s)

Bowling Green State University

Binghamton University

Year

2025

Short Description

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14). The second aim of this mixed-methods research was to unpack consequences of testing validity evidence related to the results and test interpretations, leveraging the voices of participants.

Read more about A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

Subscribe to Assessment