This article is published Feb 2020 in ‘Significance’ – the Royal Statistical Society journal
In public discourse it has become common to claim that a programme or policy is “evidence informed”. Indeed, it is often felt sufficient merely to state that a particular decision is “evidence informed”, rather than describing the nature and quality of the underlying evidence.
The move to base public policy decisions on the best scientific evidence is certainly welcome and has beeninspired and developed by initiatives such as the Cochrane Review in medicine and the Campbell Review in the social sciences, which rely upon systematic reviews of research. In this brief article I would like to explore the contemporary scene and how evidence is used or misused.
I will start by looking at what counts as evidence, followed by a discussion of how evidence should be presented, and examples of attempts to moderate ways in which evidence is used in public debate. Finally, I will look at how evidence about school performance has been used, and what lessons we might take from this.
But before considering the nature of evidence, it is worth saying that public policy decisions are, and should be, influenced by considerations beyond the research evidence, such as priorities, feasibility, acceptability, ethics and so forth: all of these will involve judgements, typically subjective ones.
Types, quality and uses of evidence
Evidence that can reasonably be termed “objective” can usefully be divided into two kinds. First, and most importantly, are inferences about causal or predictive relationships, usually across time, relating actions or social and other circumstances to later outcomes. Examples are climate change studies, educational studies of the comparative effects of different reading schemes for learners, and taxation policies designed to change behaviours such as tobacco smoking.
A second kind of evidence is useful, although secondary, and is concerned with the provenance of evidence: who has provided it, who has sponsored it, and what vested interests might be involved. Thus, there is legitimate concern about the funding of climate change research by oil companies, and for some time medical researchers studying the effects of smoking have refused to accept funding from the tobacco industry. While it could be argued that the quality of such research can be separated from its provenance, in practice this turns out to be difficult, with the possibility that subtle biases can exist in terms of assumptions made or populations studied, or that “unwelcome” results do not get published.
In addition to provenance, the general quality of research is typically supported by peer reviewing. The best journals will obtain at least two independent judgements on any paper submitted and, while not fool proof, this is perhaps the most satisfactory method that is currently available for weeding out unsatisfactory work. Nevertheless, the user of evidence can still usefully ask further questions about the way evidence is presented.
First, uncertainty should be taken into account and well communicated. A quantitative analysis of any dataset will always reflect uncertainty, due to sampling variability, errors in measuring instruments, the actual choice of the statistical model to be used, the assumptions made about distributions, and especially assumptions about independence. All of these need to be taken into account when interpreting results and making inferences about the real world, and preferably reported by the researchers together with sensitivity analyses studying the effects of changing the assumptions on the results.
Secondly, results should be presented in such a way as to allow an informed debate about how they can be used. Publicly accountable bodies – be they government offices or broadcasters – should be required to be transparent about the uncertainties and alternative explanations that might be involved. This raises the issue of how a balance might be achieved between opposing alternative interpretations, which is not simply to give any point of view equal status, but to judge whether any given view is valid. This is, of course, not easy, but is more often than not honoured in the breach, as has often been the case with groups such as climate change deniers.
Finally, those people disseminating evidence – such as journalists and policymakers – need to resist the temptation to “cherry pick” the results they like, and this leads on to the issue of how to ensure that those using evidence do so in a responsible fashion.
Moderating the debate
Full Fact is perhaps the most well-known fact-checking organisation in the UK, and it has certainly been prepared to take on politicians who have been loose in their use of statistics. For example, Full Fact took Michael Gove to task for claiming (on the basis of cherry-picking) that, as a result of the actions and decisions he took while Secretary of State for Education, nearly 2 million more children were in “good” schools in 2019 than in 2010 (bit.ly/33TcQWB). Gove failed to take account of rising pupil numbers and changes that meant that only formerly “problematic” schools were inspected after 2010 and thus rated.
Work such as this should be commended, but Full Fact and similar sites often do not go into depth on an issue, nor do they usually indicate where an in-depth debate can be found. For example, Full Fact’s piece about how many children can reach an adequate standard in reading by age 11 does not query what is meant by the Department for Education’s (DfE) “expected” standard (bit.ly/3424Wdy). The DfE says that: “To reach the expected standard in each test subject, a pupil must achieve a scaled score of 100 or more.” This is an arbitrary definition that is difficult to standardise over time, yet there is no discussion of this key point.
Sites such as Full Fact do a good job with limited resources, but one consequence of the failure to delve in depth into subjects is that journalists – with their own limited resources – may simply use and quote the assessments provided by fact-checking sites, rather than seeing such sites as a first step to following up in more detail.
Journalists may look elsewhere, of course, including to the UK Statistics Authority (UKSA), the statutory body overseeing UK national statistics, which broadly does a good job of highlighting the misuse of statistics in public debate. But the UKSA’s resources are also limited and, like Full Fact and others, it does not generally explore issues in depth.
Another organisation that comments, criticises and advises on the use of evidence in public life is the Royal Statistical Society, but it too has limited resources, and must largely rely on voluntary input from members, though it nevertheless does a lot of insightful work, with the real strength that this work is informed by expert opinion.
And what of the experts, those who produce and publish so much of the research that ends up underpinning “evidence informed” policy? Before turning our attention to rankings and league tables, which have been a concern of mine for decades, we take a brief detour to the world of academia, where government policy and the changing nature of scientific publishing give rise to concerns about the use and presentation of evidence.
Impact and access
The first concern relates to the way in which “research impacts” are assessed and reported. The value of research into such things as social policy on the amelioration of poverty is generally accepted, but attempts to relate any specific research findings, either to changes in policy or to changes in outcomes (such as numbers living in poverty), is fraught with difficulty. Sometimes, but not very often, it is possible to carry out experiments. But, in general, any claim for a causal relationship at the very least needs plausibly to rule out competing explanations involving confounders that might explain relationships across time.
However, current UK government policy devoted to the evaluation of university research – the Research Excellence Framework – explicitly encourages researchers to effectively ignore such best practice when describing their research “impact”, in favour of promoting their own research as a major driver of policy change or even change in a distal outcome (such as policy alleviation). In general this is a pretty silly thing to do and ultimately could lead to a severe distortion of research and the ethics surrounding it, as well as forcing a concentration on short-term rather than long-term objectives. In some situations it may be possible to plausibly argue such a case – perhaps more so in the humanities and parts of the natural sciences. But, in general, it is not expected that researchers will try and make a case owing to the difficulty of the task.
The second concern is to do with the evolving economics of scientific publishing. Earlier, I mentioned the importance of who pays for the research that may produce evidence, but the role of commercial publishers of books and journals has also always been important. The most recent development in this field is so-called “open access publishing”, whereby the cost of accessing a research paper – which has traditionally fallen on the reader through access to an academic library or otherwise – is now being shifted to the writer of the paper, who might be expected to pay up to £2,000 to a journal so that the work can be freely downloaded by anybody. I do not have space to go into all the details, but it should be fairly clear that, under this model of publishing, those with the financial resources to pay so-called “article processing costs” are more likely to be those whose research gets read. The social, cultural and scientific implications of this are likely to be extensive.
We have also recently seen the steady growth of “middleperson” organisations who will publicise scientific work to the public for free – but charge the researcher up to £2,000 per paper – or, alternatively, they may offer to distribute a “popular” version provided by the researcher to paying subscribers.
Both of these examples are likely to change the balance of evidence that gets used, yet neither has been discussed in open debate.
How to use evidence sensibly
So, we have evidence. Now, how do we use it? I have spent much of my career arguing about league tables, especially school ones, so I’ll end on this topic. Over last 30 years, some of us have had some success in conveying notions of statistical uncertainty (interval estimates for ranks) and the need to make adjustments for the differences in pupil intake between schools. These constraints have influenced policymakers to the extent that they are reflected in the tables they provide. But they have done little to moderate the enthusiasm of the media, who are generally unwilling to forsake the idea that what matters – or, perhaps, more cynically, what sells newspapers and website subscriptions – is a simple ranking of “best” to “worst” schools, without any concerns about uncertainty or even the need for statistical adjustment for intake. As George Leckie and I wrote in 2018 (bit.ly/379sPBQ): “[A]ccountability systems which choose to ignore pupil background are likely to reward and punish the wrong schools and this will likely have detrimental effects on pupil learning.”
In the late 1990s, work in Hampshire primary schools showed how school rankings could be used formatively for school improvement when test scores were properly adjusted for prior intake achievement and background factors. The basic idea was that instead of publishing such rankings, each school would have access to their own results and their relative position in the rankings and that these would form the basis of a constructive debate with school inspectors and school staff. Efforts such as this and in other areas, such as crime, attempting to show how evidence could be used to assist general improvement, were brought together in a monograph written by Beth Foley and I, published by the British Academy in 2012 (bit.ly/2KvfZEi).
The problem of school league tables illustrates several important points. It shows how certain kinds of evidence can be harmful if collected and then displayed in public. It shows, as in the case of university research impact assessments, that individual actors can game the system, so changing what it is intended to measure. It shows how a government can claim to be providing useful information, without any real attempt to stimulate a public debate about its usefulness. And it shows how mass media will embrace the most simplistic interpretations of the data without any encouragement to dig deeper.
To be clear: I am not advocating that we drop the idea of publicly accountable systems, rather that we move away from naïve and misleading presentations of evidence, and towards a more rational approach. In other words, league tables – for schools or other institutions – should function as one piece of evidence: as a screening device that may be able to point to concerns that could be followed-up. But any ranking, in itself, is not suitable for use as a diagnostic tool to pass definitive judgement. (See “Rethinking school accountability” for a suggestion on how we might progress in education.)
So where does this leave us? I have little doubt that, ultimately, real evidence can win out if the issue is serious enough. For example, as we see with climate change evidence, it will be ignored for as long as possible by vested interests and those policymakers who rely upon such vested interests, until its implications really can no longer be ignored. Hopefully this will not be too late for useful action.
The important thing for researchers is not to give up. The research and the publicising of the implications of that research, along with public critiques of evidence abuse or suppression, need to continue. All of this is difficult, but I think there is an ethical imperative to try to do it.
And I hope to be involved in doing just that.
About the author
Harvey Goldstein is professor of social statistics at the University of Bristol Centre for Multilevel Modelling. He was awarded the Royal Statistical Society’s Guy Medal in Silver in 1998. He was elected a member of the International Statistical Institute in 1987, and a fellow of the British Academy in 1996. He was awarded an honorary doctorate by the Open University in 2001.
BOX: Rethinking school accountability
How might we change the system of school assessment and accountability, given the many problems associated with published school league tables? Here is one suggestion.
Instead of every pupil being tested (as is the case now in England at age 11, for example) only a sample within each school is tested. This would make any school comparisons even more uncertain but still, for example, maintain sufficient numbers to study larger groups (say, at a local authority level), as well as providing useful screening evidence for use within a sympathetic inspection system.
The sample data, which can be very rich (with rotated questions) and collected by external agencies, can be used for research that seeks to uncover relationships that can help in the understanding of school processes. This research purpose is important and provides an additional strong justification for moving away from the present system. In addition, by making the tests themselves public, they could be used by all schools, for their own internal purposes, to compare the achievements of pupils against nationally validated norms.
As for accountability, I think that perhaps we have all been too hung up on the length of the uncertainty intervals and whether they overlap for pairs of schools or are significantly different from the overall average. What we really need to do is to decide how many resources are available for following up the results of this initial screening and, on that basis, choose a threshold. For example, the threshold might be set at the lowest 10% of schools, and these schools would then be contacted to determine whether the results of the initial screen are a reflection of any underlying problems.
We can, of course, have a series of thresholds determining the relative quantity of inspection resources to be allocated. In statistical terms, this becomes a problem in decision theory – for which there are known solutions – at least in principle. The same considerations can be applied to rankings based upon local authority rankings, or even those computed for groupings such as chains of schools operated by academy trusts.