Building Evidence into Education – why not look at the evidence?

Ben Goldacre’s recent paper ‘Building Evidence into Education’ has attracted a good deal of attention and debate (see for example here and here).

In it he argues that more experimental research should be undertaken in education, specifically randomized controlled trials (RCTs).  By implication he criticises educational researchers for not doing this already, and indeed at points in the paper he states that some researchers actively resist such developments (though he provides no examples).

Let’s be clear, much of what Goldacre states makes good sense. He reviews the strengths of RCTs; he argues that they should be used more widely in educational research; he notes the complementary strengths of qualitative research; and he argues that much more time and money should be devoted to producing and disseminating high quality research in education – creating an ‘information architecture’ as he puts it.

Why then do I find the paper so frustrating?  First, because Goldacre does not recognise or acknowledge the fact that educational researchers have been debating these issues for many years (along with most other social scientists). And secondly because the paper ends, as all these sorts of interventions tend to do, with a disciplinary ‘land grab’ for resources.  He concludes ‘We need academics with quantitative research skills from outside academic education departments – economists, demographers, and more, to come in and share their skills…’.  Oh yes, the economists, thank goodness for the economists, who have been so successful in modelling and developing our economy recently. Their RCTs have really helped with that.

In his own words, Goldacre’s paper is a ‘call to arms’. He sets up a rhetorical binary between educational research(ers) – ignorant, incompetent, uninterested in what might improve education – and proponents of randomized controlled trials – knowledgeable, skilled, only looking to identify what’s in the best interests of children.  While ostensibly criticising politicians (Mr. Gove?) for foisting too many untried and untested schemes on education, he plays to the same trope of positioning educators as the ‘enemies of promise’.

Large parts of the paper draw on examples from Medicine.  But given the ostensible focus on Education, might it not have been more useful to look at how these issues have been addressed in educational research over the years? The last time the issue was raised in the UK was probably David Hargreaves’s speech to the TTA in 1996 (Hargreaves 1996). This led (albeit indirectly) to a large programme of research initiated under the Teaching and Learning Research Programme.  The programme featured many mixed methods research designs, including some experimental designs (Torrance 2008). It was led by Andrew Pollard (cf. Pollard 2007), one of the National Curriculum expert panel so recently ignored by Mr. Gove.

However debate in the field long predates this most recent manifestation. Campbell and Stanley’s (1963) classic contribution on ‘Experimental and Quasi-Experimental Designs’ reviews the problems and possibilities of developing RCTs in education in far more detail than Goldacre, noting especially “the intransigence of the environment…that is the experimenter’s lack of complete control”.  In a text advocating experimental design, they nevertheless review threats to internal and external validity at great length and highlight the difficulties of running RCTs properly and effectively. In turn they acknowledge McCall’s (1923) ‘How to Experiment in Education’ and note that there have been regular periods of RCT advocacy and RCT disillusionment in educational research as the clear cut results that RCTs promise have been unforthcoming.

And here’s the rub. The answers to questions of public policy and educational evaluation are often not very clear (nor indeed are the questions sometimes). More circumspect proponents of experimental methods than Goldacre, acknowledge that in order for a causal relationship to be established, even within the narrow terms of an RCT, very specific questions have to be asked. In a collection of papers produced from a conference specifically convened to promote “Randomized Trials in Education Research”, Judith Gueron (2002) argues that while “random assignment . . . offers unique power in answering the ‘Does it make a difference?’ question” (p. 15), it is also the case that “[t]he key in large-scale projects is to answer a few questions well” (p. 40). In the same volume Thomas Cook and Monique Payne (2002) agree that

most randomized experiments test the influence of only a small subset of potential causes of an outcome, and often only one. . . . even at their most comprehensive, experiments can responsibly test only a modest number of the possible interactions between treatments. So, experiments are best when a causal question involves few variables [and] is sharply focused. (p. 152)

Thus RCTs can be very good at answering very specific questions. What they cannot do is produce the questions in the first place: that depends on much prior, often qualitative, investigation, not to mention value judgments about what is significant in the qualitative data and what is the nature of the problem to be addressed by a particular program intervention. Nor can RCTs provide an explanation of why something has happened. That will depend on much prior investigation and, if possible, parallel qualitative investigation of the phenomenon under study, to inform a developing analysis of what the researchers think may be happening.

Much of Goldacre’s paper is devoted to what RCT’s have achieved in medicine. There is little acknowledgement of the differences between medical and educational research. There is virtually no reference to the long history of RCTs in education (i.e. the actual evidence in this field) and how often they result in ‘no significant difference’ being reported between control and experimental groups, even when problems of design and conduct of RCTs have been overcome (or, perhaps, because they haven’t). Goldacre states that ‘there have been huge numbers of trials in education in other countries, such as the US’ (again, by implication, castigating educational researchers in the UK), but says nothing about the lack of definitive results. In fact recent findings from the United States have been disappointing. Viadero, Education Week, 1 April 2009, reports: ‘Like a steady drip from a leaky faucet, the experimental studies being released this school year by the federal Institute of Education Sciences are mostly producing the same results: “No effects,” “No effects,” “No effects”.

We should not be surprised. It was precisely the confounding problems of diverse implementation and interaction effects that produced so many “no significant difference” results in the 1960s in the context of the first wave of early childhood intervention and curriculum evaluation studies. Reflections on such results prompted the development and use of qualitative methods in evaluation studies in the1970s and 1980s. Of course it might still be argued that it is important to know that something doesn’t work.  It can also be argued that this is how knowledge advances in science – especially the natural sciences – the accumulation of many negative results before something significant appears to emerge. But Goldacre’s paper makes no reference to such complications – it simply assumes that RCTs will prove what does work, in a very straightforward manner.

Furthermore, the paper assumes that educational researchers are ignorant of RCTS, but as we have seen, this is not the case. Quite the reverse, educational researchers know all too well the pitfalls as well as the possibilities of RCTs, and are appropriately cautious about what they can achieve. While it might still be argued that undertaking more RCTs will benefit education, it cannot be argued, as Goldacre does in his opening paragraph, that this will provide ‘better evidence about what works best’. RCTs simply don’t provide that level of certainty.

Nor are even positive results easily generalised and disseminated to other contexts. Without a reasonable understanding of why particular outcomes have occurred, along with identifying the range of unintended consequences that will almost inevitably accompany an innovation, it is very difficult to generalize outcomes and implement the innovation with any degree of success elsewhere. A good example of this is provided by California’s attempt to implement smaller class sizes off the back of the apparent success of the Tennessee “STAR” evaluation. The Tennessee experiment compared the effects of smaller class size on student achievement, but worked with a sample of schools.  California attempted statewide implementation, producing  more problems than they solved by creating teacher shortages, especially in poorer neighbourhoods in the state. There simply weren’t enough well-qualified teachers available to reduce class size statewide, and those that were tended to move to schools in richer neighbourhoods when more jobs in such schools became available (see Grissmer, Subotnik, & Orland, 2009).

RCTs might provide more evidence, different evidence, and, if properly funded and undertaken in the context of parallel, large scale, longitudinal, qualitative studies,  ‘better’ evidence of what works and why, for different groups in different contexts.  We certainly need more and better research. But ultimately this must be understood as providing a better resource for collaborative decision-making between researchers, teachers, students, parents and local authorities or clusters of schools.  It cannot define what ‘works best’.  There is no such thing in social action, across time, place and differing circumstances. To pretend otherwise is to assert the primacy of one particular research method over the provision of a wide range of different sorts of evidence to inform debate.

Replacing a system currently at the mercy of political whim, with a system driven by a narrow version of science, isn’t going to improve matters.  Let’s produce better evidence by all means, but we have to be appropriately modest about what research can achieve, and research has to develop in tandem with developing better forms of community engagement with our schools.

Prof. Harry Torrance


Campbell D. and Stanley J. (1963) ‘Experimental and Quasi-experimental Designs for Research on Teaching’ in Gage N. (Ed) Handbook of Research on Teaching Houghton Mifflin, Boston.

Cook, T., & Payne, M. (2002) ‘Objecting to the objections to using random assignment in educational research’ in F. Mosteller & R. Boruch (Eds.), Evidence matters: Randomized trials in education research (pp. 150–178). Washington, DC: Brookings Institution Press.

Goldacre B. (2013) Building Evidence into Education Department for Education, London, Available at:

Grissmer, D., Subotnik, R., & Orland, M. (2009). A Guide to incorporating multiple methods in randomized controlled trials to  assess intervention effects. Available at

Gueron, J. (2002) ‘The politics of random assignment: Implementing studies and affecting policy’ in F. Mosteller & R. Boruch (Eds.), Evidence matters: randomized trials in education research (pp. 15–49). Washington, DC: Brookings Institution Press.

Hargreaves, D. (1996). Teaching as a research-based profession. Teacher Training Agency 1996 Annual Lecture. London: Teacher Training Agency.

McCall W. (1923) How to Experiment in Education, New York, MacMillan

Pollard, A. (2007) The UK’s Teaching and Learning Research Programme:  findings and significance British Educational Research Journal 33, 5, 639-646

Torrance, H. (2008). Overview of ESRC research in education: A consultancy commissioned by ESRC: Final report. Available at

Viadero, D. (2009, April 1). “No effects” studies raising eyebrows. Education Week. Available at–NoEffectsArticle.pdf

11 thoughts on “Building Evidence into Education – why not look at the evidence?

  1. Interesting riposte, but I am no clearer how research in education might provide any answers. I would prefer more ‘no effects’ from RCTs than misplaced subjective and qualitative outcomes from other study designs. RCTs like peer review = least worse option?

  2. Maybe that should have been written: I might prefer more ‘no effects’.
    Anyway, another thought: as I read this, the failure of the CA smaller class size initiative is not an indictment of RCTs per se, rather of failing to providing sufficient resources for the program.

  3. Thanks Paul – fair points – though I wouldn’t agree that qualitative work is ‘misplaced’. I’m not sure what you mean by that and really don’t understand the antagonism. In limited space I guess all I can do is point again to my final three paragraphs. By all means let’s design more trials, but not to the exclusion of other research, and not in the belief that they will straightforwardly tell us what ‘works best’. Evidence has to be interpreted and acted upon in situ, at local level, Your observation about the California case demonstrates this.

    • Hi Harry, sure ‘misplaced’ may be the wrong word, ‘misguided’ or ‘low quality’ is perhaps what I meant. I am not antagonized at all. I am really interested in this debate. I am a biochemist and think that I understand the ‘power’ of RCTs. Your essay does not seem to hold out much hope that any question can be answered by any alternative, hence my ‘least worst option’ comment. BTW my interest arises in part because I came across a similar issue in relation to my role as a school governor:

      The school is obliged to report the ‘impact’ of the Pupil Premium funding. I have the concern that this is not properly measurable as there is no control group — a group of matched pupils that has not received the PP funding — except a group that is in another school year and has had different teachers, etc. (i.e. not properly matched). To me this makes a nonsense of stating the ‘impact’. Sure, the progression and attainment of poorer students may improve over time, but it will not be definitive that such improvement is down to PP.

  4. Hi Paul, qualitative work can answer many questions, but perhaps not every question in the way you would want (‘any question’). The example of PP is an interesting one. A lot depends on what one means by ‘impact’ – narrowly or broadly defined? Such funding is unlikely to have ‘no effects’, though quite likely that what effects there are cannot be detected (as significant) by an RCT – even if you could run one, which you clearly can’t. Thus evidence should by collected by other means. We’re straying into ‘evaluation’ rather than ‘research’ now, but if the purpose is to provide information for decision-making, then some evidence is better than no evidence, and impact on available teaching resources, home-school communication, teacher morale, pupil motivation, etc. are all important to know about in coming to decisions about how to use the funding and whether or not to continue it. Happy to correspond further by email if you wish.

  5. I have an awful lot of respect for Ben Goldacre, but I am also wary of experts moving away from their specialty. Pons and Fleschmann’s work on cold fusionand Pauli and Vitamin C would be two specific examples. With this in mind, I’d like to throw this piece I wrote some time ago into the debate:

    Within science, we are used to a solution space with one of a few optimal solutions. We believe in the concept of an objective truth which can, in priciple, for the most part, be established by a series of reproducable experimental tests.
    Education is different, its like art: There is no best picture, there are a load of different pictures, some of which need a lot of theory to make them work, some of which just look good. No picture appeals to every one, though some of them can be condemned by just about every one as being a load of crap. And, at the high end there’s a specialized group of art critics and artists who practice in little specialized bubbles where they play glass bead games with each other.
    Its not a science, and we don’t have educational engineering as a specialty yet.

    That said I now return to the issue. Ben G. may be forgiven (we don’t have to) for having missed the good randomized experimental work which has been done. There are very many studies carried out on low sample sizes, with no control group and no attempt to even pretend to objectivity in the results that its hard to believe that there may be any good stuff in there.

    Many educational researchers play an enabling dysfunctional role in promoting this all pervading background noise in the research environment. I once criticized a quantitative study based on the researchers six graduate students on the basis that the sample size was to small to warrant the extensive computer based statistical analysis of the four structured interviews that were completed, and I was told that my argument was just statistics, and so it didn’t count.
    As a group educational researchers are often quite poor at discerning the confirmatory effect of observer bias and confounding it with high quality qualitative or mixed methods research.

    I’m making two points here 1) In some senses our field is closer to art than a science so objective criteria may be harder to use, and 2) There is a lot of bad quantitative work out there which makes it difficult to see the good.

  6. Pingback: ‘Interested Amateurs’ and Educational Research | ESRI Blog

  7. Pingback: Another Year, Another BERA | ESRI Blog

  8. Pingback: Celeb Youth » When policymakers and academics collide OR evidence is in the eye of the beholder

Leave a Reply to Harry Torrance Cancel reply

Your email address will not be published. Required fields are marked *