What do student evaluations indicate about learning, and how should teaching be evaluated?
Existing studies examining correlations between student ratings and student learning seem to have mixed results and, understandably, many confounding factors. Because student ratings are currently the main method for evaluating many higher education faculty, I share some studies below that question this dominant use of student ratings.
The most controlled study I've yet seen of longer-term learning (Carrell & West, 2010, Journal of Political Economy) finds a negative correlation between student evaluations and "deep learning" (defined as understanding that helps students do better in subsequent courses). Stark & Freishtat (2014) provide other evidence for decreasing emphasis on student evaluations.
In physics in particular, Eric Mazur found a decrease in student evaluations and an increase in learning when he integrated active learning into his courses (Crouch, Watkins, Fagen & Mazur, 2007, Research-Based Reform of University Physics): "..student evaluations and attitude are not a measure of student learning;... we saw high learning gains for the students in the algebra-based course in spite of lower perceived satisfaction overall. Other instructors report similar experiences. Furthermore, research indicates that student evaluations are based heavily on instructor personality rather than course effectiveness."
If student evaluations of instructors don't necessarily measure learning, what do they measure? Stark & Freishtat (2014) identify some items that correlate with student ratings of instructors based on various studies, but all with little consensus:
student grade expectatons
first impressions and physical attractiveness (student evaluations can be predicted from their reaction to 30 seconds of silent video of the instructor)
gender, ethnicity and age
What can we use instead of, or to supplement, student evaluations?
Many institutions, such as University of Michigan, recommend a more holistic assessment based on a combination of student ratings (with at least 50% response rate), peer ratings, instructor portfolio, and other supporting materials (see figure below from Felder and Brent, 2004).
One promising tool for peer assessment is the Reformed Teaching Observation Protocol, which lists 25 criteria ("subscale items" in link) found to improve teaching.