Theory-based evaluations within a counterfactual paradigm
Interest in theory-based evaluation has grown significantly over the last decade and central government commissioners are more frequently identifying it as the anticipated or preferred approach in invitations to tender. Theory-based evaluations are generally seen as those that explore the ‘causal chains’ within interventions, often using qualitative methods, and which make an assessment of whether the intervention has been effective or not without necessarily using a comparison group. This contrasts with counterfactual approaches, particularly randomised controlled trials, which use statistical methods and are centred on the identification or creation of comparison groups. This has led to the perception – and to some extent the reality – of two camps of researchers, the ‘randomistas’ versus those who question the appropriateness and usefulness of RCTs and counterfactual approaches to evaluating social policy.
The case against counterfactual evaluation methods
The Magenta Book, the HM Treasury’s guide to evaluation, takes a ‘toolbox’ approach to methods, suggesting that different methods are useful in different contexts. In terms of theory-based impact evaluation, it states that they ‘tend to be particularly suited for the evaluation of complex interventions or simple interventions in complex environments’ (Magenta Book, 2020). In stark contrast, key proponents of theory-based methods question the fundamental suitability of counterfactual approaches to evaluation within social policy. Nick Tilley, one of the authors of the influential book Realistic Evaluation (Pawson and Tilley, 1997), summarised their position by saying:
Ray Pawson and I are highly sceptical of… [traditional approaches to] experimentation. We are doubtful of this as a method of finding out which programmes do and do not produce the intended and unintended consequences. We do not believe it to be a sound way of deriving sensible lessons for policy and practice. (Tilley, 2000)
Central to their critique of RCTs is the observation that trials of interventions produce different results in different locations (ibid) and thus do not provide the definitive assessment of impact their proponents claim. It is important to note here that unlike the Magenta book, Tilley is objecting to counterfactual impact evaluation in principle, not just saying that different methods are appropriate in different contexts or for different types of intervention.
Similarly, John Mayne, the originator of contribution analysis, also rejects the use of counterfactual impact evaluation in principle. His argument is less related to the perceived methodological failure of counterfactual approaches and instead focuses on the nature of social policy interventions themselves and his belief that traditional evaluation methods are unable to reflect their complexity. To evaluate them appropriately requires a different conception of causality, one he labels ‘generative causality’.
Contribution analysis aims at arriving at credible claims on the intervention as a contributory cause… [using] the generative perspective on causality to assess whether the intervention has ‘made a difference’. Made a difference in this context means that the intervention had a positive impact on peoples’ lives, that is, it made a contribution. It played a causal role. In most settings, on its own, the intervention would not make a difference. It is the associated package of factors that makes a difference. This interpretation of making a difference needs to be distinguished from the meaning associated with the counterfactual perspective on causality, where made a difference means ‘what would not have happened without the intervention’. (Ibid, italics in the original)
Thus, embedded within two of the leading approaches to theory-based evaluation is not just the idea that they are useful in some circumstances when counterfactual approaches are not feasible or acceptable, but that they are fundamentally alternative approaches to impact evaluation and by implication are preferable in most or all circumstances.
The strengths and limitations of counterfactual approaches
The criticisms of counterfactual approaches to evaluation advanced by Tilley and Mayne seem flawed to me. Tilley’s argument that RCTs cannot be trusted because they can produce different results is circular; it is only because they generate internally valid findings that we can conclude that the impact of the intervention is different in different contexts. In effect, Tilley’s argument that RCTs are not a valid approach to evaluating social interventions rests on the validity of the approach he is saying is invalid. Mayne’s argument also seems logically incorrect. Even if it is true that an intervention is only a contributing cause and would not make a difference on its own, this does not imply that counterfactual approaches are inappropriate. There are many examples of counterfactual approaches identifying exactly this issue. For example, researchers have identified a gene that increases the risk of depression but only if a number of life stressors are present (Caspi et al, 2003) which was discovered using counterfactual techniques. In general, this kind of causal relationship is part of the class of phenomena known as ‘interaction effects’, which can be explored through quasi-experimental statistical techniques and multi-armed RCTs.
While Tilley and Mayne’s specific criticisms of RCTs seem problematic, the more general argument that counterfactual approaches have limitations is undeniably true. In fact, it is built into their underlying logic. The Rubin causal model (Imbens and Rubin, 2010) is based on the theoretical construct of ‘potential outcomes’. Within this model impact is defined as the outcome experienced by an individual having received an intervention compared to the same individual if they had not received the intervention (the counterfactual outcome). By definition, the counterfactual is not observable, it only exists as a ‘potential’ (as any single individual cannot both experience and not experience the intervention at the same point in time). Therefore, all counterfactual evaluation approaches are efforts to estimate this non-observable counterfactual and by definition are imperfect. They do so by looking at groups of individuals with the hope that on average any differences between the groups will be due purely to chance and therefore can be accounted for using statistical techniques. Within this framework, RCTs are seen as the best way to approximate the counterfactual because they are the approach most likely to eliminate systematic differences between the intervention and control groups and therefore any average difference will be due solely to the intervention. Quasi-experimental approaches look for ‘as-good-as-random’ equivalents, which are often combined with matching techniques in an attempt to achieve the same result.
The fact that even the most robust counterfactual approaches are only imperfect attempts to estimate the impact of an intervention, means that even RCTs that are implemented perfectly have substantial limitations. These include self-selection into the trial (limiting their generalisability), spillover effects, equilibrium effects, experiment effects (such as people acting differently because they are being observed), and context effects (they are undertaken within particular contexts which will influence the outcomes). The issues with quasi-experimental methods are additionally compounded by the fact that they often rest on assumptions that cannot be tested. These limitations mean that counterfactual evaluations are not definitive and therefore benefit from many of the approaches and techniques advocated by theory-based evaluators to understand and interpret their findings. In this sense, all evaluations are or should be theory-based whether or not they are framed that way because explicitly or implicitly they all rely on theory to understand the results.
However, the fact that counterfactual approaches have limitations is not a convincing argument for abandoning them as an approach. Nor is it a good argument for replacing them with purely theory-based approaches, not least because the methods advocated by theory-based approaches have even greater limitations in terms of assessing impact. Qualitative methods (one of the central techniques used) can provide exceptionally rich and insightful data but there are fundamental risks in relying on people’s self-assessments of what has caused an outcome because research has provided compelling evidence that people’s self-attributions can be limited and incorrect. One classic example is an experiment that involves varying product placement to change buying decisions, but those involved in the experiment deny that the product placement had anything to do with their decision. More generally, people have a limited perspective and are subject to a whole range of biases, which means qualitative evidence cannot be relied on to provide robust evidence of impact (Matute et al, 2015).
A second problem is related to numbers. Looking for average effects is not the primary reason for using counterfactual approaches, the primary reason is that there is no counterfactual for an individual. But within a policy context the average effect of an intervention is important. Even if a theory-based approach could reliably attribute an outcome to an intervention for some individuals it matters whether this happens on average or not because an intervention that only helps a small number of people, or helps some people but harms others, is not necessarily cost effective or ethical. And because of the intensive nature of theory-based approaches, particularly those relying heavily on qualitative research, they will often struggle to say what happens on average.
Evidence hierarchies and theory-based evaluation
Evidence hierarchies take into account the limitations to counterfactual evaluations described above. For example the highest rung on the Maryland Scientific Methods Scale requires an intervention to have been found to be successful in ‘multiple rigorous evaluations’ (EIF, 2019), reflecting Tilley’s observation that RCTs in different locations may produce different results. Similarly, systematic reviews, such as the ones undertaken for What Works centres, take into account the fact that single studies do not provide definitive answers and aim to provide an assessment of evidence quality rather than simply stating effect sizes; for example the Youth Endowment Fund Toolkit uses four criteria to rate the confidence in the evidence base for particular types of intervention, including the number of studies available, confidence in the methodology, consistency of effect sizes, and the type of outcome measure (YEF, 2021).
The lowest evidence rating within these kinds of evidence scales typically require studies that include:
Either (a) a cross-sectional comparison of treated groups with untreated groups, or (b) a before-and-after comparison of treated group, without an untreated comparison group. No use of control variables in statistical analysis to adjust for differences between treated and untreated groups or periods. (What Works for Local Economic Growth, 2016)
Whereas the second lowest level requires:
Use of adequate control variables and either (a) a cross-sectional comparison of treated groups with untreated groups, or (b) a before-and-after comparison of treated group, without an untreated comparison group. In (a), control variables or matching techniques used to account for cross-sectional differences between treated and controls groups. In (b), control variables are used to account for before-and after changes in macro level factors. (Ibid)
This means that theory of change evaluations do not reach the standard required for the second level and may or may not reach the standard for the lowest level. I think this is a fair assessment of the robustness of theory of change evaluations in terms of impact evaluation. This is of course frustrating if it is not possible or feasible to undertake an evaluation that would rank more highly on the evidence scale. But the fact that it is difficult or impossible to undertake a counterfactual evaluation does not invalidate the logic of what is needed to establish causality.
Within an academic setting, one might be tempted to simply conclude that theory of change evaluations do not produce robust evidence of impact and therefore urge those designing evaluations to use counterfactual approaches wherever possible. However, in a policy environment - as in life generally – people have to make decisions based on imperfect evidence. Counterfactual approaches may not be possible (as is the case in macroeconomic decisions), or they might not be ethical or acceptable (as in assessing the effectiveness of children’s social care services) or there may be policy reasons, such as manifesto commitments, that mean an intervention is introduced in a way that precludes the possibility of identifying a meaningful comparison. In these caes, theory of change evaluations can enable decision-makers to make the most informed choice in the absence of robust counterfactual studies.
Implementing theory-based evaluation
In many ways, theory of change evaluations follow the same steps as all impact evaluations:
Identify the key research questions
Develop robust theories of change
Gather evidence about the impact of the intervention
Review and assess the evidence gathered
Reach a conclusion about whether and to what degree the intervention has worked
Assess the strength of the evidence and describe its limitations
Theory of change evaluations obviously differ from counterfactual ones in terms of step 3, as the process of gathering evidence about impact will not generally involve an RCT or quasi-experiment approach. However, they also differ in the sense that theory of change evaluations require a theory of change whereas counterfactual approaches theoretically can be implemented without one. This is not something I’d recommend (for the reasons discussed above), but counterfactual evaluations can in principle be ‘black box’ evaluations that just involve measuring specified outcomes without any clear idea about how an intervention is meant to work.
The crucial difference to counterfactual evaluations is assessing the degree to which the data provides evidence of impact. Process tracing uses four ‘tests’ (straw-in-the-wind, hoop, smoking-gun, and doubly decisive) to make this assessment (Collier, 201). The problem is that these tests imply that logical operators can substitute for empirical evidence - because the analyst says a piece of evidence is ‘doubly decisive’ it therefore provides strong evidence of causality. It is a slight of hand that distracts from the central question of why a particular piece of evidence is ‘doubly decisive’ and which pieces of evidence count as ‘doubly decisive’. In effect, these ‘tests’ are just descriptive labels for an evaluator’s individual judgement and fail to provide a robust framework for making those judgements. To fill that gap, I think it is useful to consider the following elements that are needed to establish and assess causation within the Rubin causal model; the role of the analyst is then to provide evidence against each element and an argument for its robustness.
Change in outcome – an essential part of an impact evaluation is evidence that the relevant outcome has changed. This is not necessarily as obvious as it sounds, as sometimes evaluations do not measure outcomes. Communication campaigns often measure things such as website hit rates or changes in awareness or perceptions but not actually the behaviour that has been targeted (for example Twitch’s anti drink-driving campaign ‘All good. All bad’, Awards Analyst, 2022). Behaviour can be measured through self-reports (using qualitative or quantitative data) or observation (such as official offending data), and the reliability of these different sources should be considered. It is also worth noting that the change is relative to what it would have been without the intervention; therefore ‘no change’ could reflect an impact if the outcome would have got worse without the intervention.
Time and sequencing – within Rubin causal model cause precedes effect, which means that it is useful to use the data gathered to assess whether each step in the causal chain happens before the following one, for example that a young person’s behaviour changes after they’ve attended a sports intervention not before. If only motivated young people were attracted to the intervention, their behaviour could be already changing before attending, which would indicate that the intervention was not necessarily having an impact. However, there are some instances where this is less clear cut. If the young person is changing their behaviour in anticipation of starting the intervention but would not have done so without the motivation of the intervention, the change is still causally related to the intervention. This is something that in-depth qualitative research can help tease out.
Attribution – this is the central and often most difficult aspect of impact evaluation and the most challenging for theory of change approaches to evidence and will rely on a thorough exploration of attribution during the process of gathering data, for example through careful probing in qualitative interviews or precisely worded questions in surveys.
Magnitude of effect – an important aim of counterfactual evaluation is to provide an estimate of the magnitude of an effect and not just whether there is any impact, for example being able to say that one year of additional schooling leading to an extra £5000 of salary at age 30 rather than just it increases people’s salaries. Theory of change evaluations are not able to provide precise numerical estimates of effect sizes but can still explore how big a difference an intervention seems to have made. For example, a young person involved in an intervention could say that they feel it has completely changed their lives in terms of what they do and how they feel about themselves or they could say it keeps them out of trouble a bit but has not made much difference. This cannot be translated into formal ‘effect sizes’ but the range of the perceived size of impacts can be described.
Heterogeneity – counterfactual impact evaluations start by looking at average impacts but can undertake subgroup analysis where sample sizes allow. One advantage of theory of change evaluations is that they often have a strong focus on the range and diversity of experiences and this can help provide insight into how the effect of the intervention may vary across individuals with different backgrounds and characteristics. If the data indicates that an intervention mostly engages young people who already have positive experiences of an activity then this provides insight not only into who may benefit but also the likelihood that the intervention is effective for the intended target population.
Context – the realistic evaluation mantra ‘what works for whom in what circumstances’ (Pawson and Tilley, 1997) usefully draws attention to the influence of the environment within which an intervention operates and this is something that theory of change approaches again can explore to great effect.
Once the evidence has been assembled and assessed against each of these elements, there is the crucial question of reaching an overall conclusion about the impact of the intervention. As noted above, theory of change evaluations may not make it onto the bottom rung of traditional evidence hierarchies, so any conclusion based on them needs to reflect the limited robustness of the evidence. One way of doing this is to express the conclusion in probabilistic terms, for example concluding that an intervention is a ‘good bet’ or a ‘bad bet’. This might seem a frivolous way of presenting evidence for assessing social policy, but it in fact reflects how we all make most of our choices. The decision to go to war or to pull out into traffic or to arrange a picnic for next Saturday is not based on an RCTs or a quasi-experimental evaluation because it’s not possible to generate that kind of evidence. Instead you gather the best evidence that is available and make a probabilistic judgement. In a policy context where counterfactual methods aren’t possible, theory of change approaches are the best way of answering the question, ‘is it a good bet or not?’.
References
Awards Analyst. 2022. How StreetSmarts mirrored the effects of drink driving in a Minecraft livestream, The Drum, [online]. Available at: https://www.thedrum.com/news/2022/12/09/how-streetsmarts-mirrored-the-effects-drink-driving-minecraft-livestream [Accessed 17 October 2023]
Caspi A, Sugden K, Moffitt TE, et al. 2003. Influence of life stress on depression: moderation by a polymorphism in the 5-HTT gene. Science, 301:386–9.
Collier, D. 2011. Understanding Process Tracing. Political Science and Politics, 44, No 4: 823-30
Early Intervention Foundation, 2023. EIF Evidence Standards. Early Intervention Foundation. [online] Available at: https://guidebook.eif.org.uk/page/eif-evidence-standards [Accessed 17 October 2023
Gaffney, H., Jolliffe, D., and White, H., 2021. Sports Programmes: Toolkit Technical Report. [pdf] London: Youth Endowment Fund. Available at: https://youthendowmentfund.org.uk/toolkit/sports-programmes/
[Accessed 17 October 2023]
HM Treasury, 2020. Magenta Book: Central Government guidance on evaluation. [pdf] Available at: https://www.gov.uk/government/publications/the-magenta-book [Accessed 17 October 2023]
Imbens, G.W. and Rubin., D.B., 2010. The Rubin Causal Model. In Durlauf, S.N. and Blume, L.W., ed. 2010. Microeconomics. Basingstoke: Palgrave Macmillan, pp 229–241
Matute H, Blanco F, Yarritu I, Díaz-Lago M, Vadillo MA and Barberia I., 2015. Illusions of causality: how they bias our everyday thinking and how they could be reduced. Front. Psychol. 6:888. doi: 10.3389/fpsyg.2015.00888
Mayne, J., 2020. A brief on contribution analysis: Principles and concepts. [pdf] London: Evaluating Advocacy. Available at: https://www.evaluatingadvocacy.org/resources.php [Accessed 17 October 2023]
Pawson, R. and Tilley, N. (1997). Realistic Evaluation. London: Sage
Puttick, R., 2018. Mapping the Standards of Evidence Used in UK Social Policy. [pdf] London: Nesta. Available at:
https://www.nesta.org.uk/report/mapping-standards-evidence-used-uk-social-policy/?gclid=Cj0KCQjw1aOpBhCOARIsACXYv-eizgjRnLsH7vX9XshkUIWYz-XoGrXnPJAyFtPKeMH8CsFr0ro_eigaAt4bEALw_wcB [Accessed 17 October 2023]
Thornberry, T.P., & Krohn, M.D. 2000. The self-report method for measuring delinquency and crime. Measurement and Analysis of Crime and Justice, 4, 33-83.
Tilley, N., 2000. Realistic Evaluation: An Overview. [pdf] Available at: https://www.researchgate.net/publication/252160435_Realistic_Evaluation_An_Overview [Accessed 17 October 2023]
What Works for Local Economic Growth, 2016. Guide to Scoring Evidence Using the Maryland Scientific Methods Evidence Scale. [pfd] London: What Works for Local Economic Growth. Available at: https://whatworksgrowth.org/resource-library/guide-to-scoring-the-evidence/