External Control Arm for an Early-Phase Oncology Trial

A reproducible RWE walkthrough using propensity score methods

Author

RWE Demo

Published

May 13, 2026

1 Background and objective

In early clinical development (Phase I / II), a single-arm trial design is often used to accelerate decision-making in oncology. However, without a randomized comparator it is difficult to attribute observed outcomes to the investigational treatment rather than to patient selection or natural disease history. External control arms (ECA) constructed from real-world data (RWD) — such as electronic health records (EHR), claims, or registries — provide a way to contextualize single-arm trial results (1,2).

This report walks through, end to end, how to build and analyze an ECA using propensity score matching and inverse probability of treatment weighting (IPTW), with all code and data fully reproducible.

NoteLearning objective

By the end of this report you should be able to:

  1. Recognize the sources of confounding when comparing a trial cohort to RWD.
  2. Estimate a propensity score, match patients, and compute IPT weights.
  3. Diagnose covariate balance before and after adjustment.
  4. Estimate and interpret a treatment effect (HR) on a time-to-event endpoint using Cox proportional hazards.
  5. Produce a regulator-style report that is fully reproducible.

2 Data: a synthetic but realistic RWD cohort

For teaching purposes we generate a synthetic dataset whose structure mirrors an oncology RWD source. The trial cohort is drawn from the same underlying population as the external controls but is preferentially enriched for younger, fitter patients (lower ECOG, more favorable biomarker, fewer prior lines). This is the kind of selection bias one always faces when comparing trials to RWD.

Code
cohort <- simulate_rwd(n_trial = 80, n_external = 800, seed = 42)
glimpse(cohort)
Rows: 880
Columns: 9
$ patient_id  <chr> "P00001", "P00002", "P00003", "P00004", "P00005", "P00006"…
$ age         <dbl> 78.7, 59.4, 68.6, 71.3, 69.0, 63.9, 80.1, 64.1, 85.2, 64.4…
$ sex         <fct> F, F, M, F, M, M, F, F, F, M, M, M, M, M, M, F, F, F, M, F…
$ ecog        <fct> 1, 1, 2, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 2, 2, 2, 1, 0, 0, 2…
$ biomarker   <dbl> 1.621, -0.687, 0.657, 1.111, 2.141, -0.831, 0.076, 0.936, …
$ prior_lines <int> 1, 1, 1, 1, 0, 0, 0, 1, 3, 3, 1, 4, 0, 0, 1, 2, 0, 3, 2, 2…
$ treatment   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ time_months <dbl> 5.11, 10.48, 0.27, 13.39, 13.12, 36.00, 13.74, 15.90, 2.73…
$ event       <int> 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1…

3 Baseline characteristics

Code
cohort |>
  mutate(arm = factor(treatment, levels = c(0, 1),
                      labels = c("RWD control", "Trial"))) |>
  select(arm, age, sex, ecog, biomarker, prior_lines) |>
  tbl_summary(
    by = arm,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    )
  ) |>
  add_p() |>
  add_overall() |>
  modify_caption("**Table 1. Baseline characteristics, unadjusted.**")
Table 1. Baseline characteristics, unadjusted.
Characteristic Overall
N = 8801
RWD control
N = 8001
Trial
N = 801
p-value2
age 64 (10) 65 (10) 58 (9) <0.001
sex


0.7
    F 443 (50%) 401 (50%) 42 (53%)
    M 437 (50%) 399 (50%) 38 (48%)
ecog


<0.001
    0 355 (40%) 287 (36%) 68 (85%)
    1 391 (44%) 379 (47%) 12 (15%)
    2 134 (15%) 134 (17%) 0 (0%)
biomarker 0.03 (1.00) -0.06 (0.96) 0.90 (0.89) <0.001
prior_lines



    0 265 (30%) 224 (28%) 41 (51%)
    1 333 (38%) 304 (38%) 29 (36%)
    2 167 (19%) 160 (20%) 7 (8.8%)
    3 78 (8.9%) 75 (9.4%) 3 (3.8%)
    4 30 (3.4%) 30 (3.8%) 0 (0%)
    5 6 (0.7%) 6 (0.8%) 0 (0%)
    6 1 (0.1%) 1 (0.1%) 0 (0%)
1 Mean (SD); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test; NA

The trial arm is systematically different from the RWD pool. A naive comparison of survival between the two arms would attribute part of this baseline difference to the treatment. We need to adjust for confounders.

4 Propensity score estimation

A propensity score is the probability of being in the trial arm given baseline covariates (3). Conditioning on the propensity score balances those covariates between arms.

Code
weighted <- compute_iptw(cohort)
ggplot(weighted, aes(x = ps, fill = factor(treatment))) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("0" = "#4C78A8", "1" = "#F58518"),
                    labels = c("RWD control", "Trial"),
                    name = NULL) +
  labs(title = "Propensity score distribution by arm",
       x = "Estimated propensity score", y = "Density") +
  theme(legend.position = "top")

The two distributions overlap, which is the positivity assumption needed for valid causal inference. Without overlap we cannot construct a credible ECA in the regions of non-overlap.

5 Approach 1: Propensity score matching (3:1)

We match each trial patient to up to 3 RWD controls within a caliper of 0.2 SD on the logit-PS scale.

Code
matched <- match_eca(cohort, ratio = 3, caliper = 0.2)
table(matched$treatment)

  0   1 
121  55 

5.1 Balance diagnostics

Code
balance_tbl <- function(d) {
  d |>
    mutate(arm = factor(treatment, levels = c(0,1),
                        labels = c("RWD control", "Trial"))) |>
    select(arm, age, sex, ecog, biomarker, prior_lines) |>
    tbl_summary(by = arm,
                statistic = list(all_continuous() ~ "{mean} ({sd})"))
}
balance_tbl(matched) |>
  modify_caption("**Table 2. Baseline characteristics after matching.**")
Table 2. Baseline characteristics after matching.
Characteristic RWD control
N = 1211
Trial
N = 551
age 61 (8) 60 (8)
sex

    F 62 (51%) 30 (55%)
    M 59 (49%) 25 (45%)
ecog

    0 97 (80%) 45 (82%)
    1 24 (20%) 10 (18%)
    2 0 (0%) 0 (0%)
biomarker 0.58 (0.81) 0.72 (0.84)
prior_lines

    0 48 (40%) 23 (42%)
    1 50 (41%) 23 (42%)
    2 19 (16%) 6 (11%)
    3 4 (3.3%) 3 (5.5%)
1 Mean (SD); n (%)

After matching, mean age, ECOG distribution and biomarker level are much closer between arms.

6 Approach 2: IPTW (stabilized weights)

Code
summary(weighted$siptw)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09202 0.90911 0.91117 0.93476 0.93615 4.11451 

Extreme weights inflate variance. Truncation at the 1st/99th percentile is a common stabilization.

Code
q <- quantile(weighted$siptw, c(0.01, 0.99))
weighted$siptw_trunc <- pmin(pmax(weighted$siptw, q[1]), q[2])

7 Treatment effect on overall survival

7.1 Kaplan–Meier on the matched cohort

Code
fit_km <- survfit(Surv(time_months, event) ~ treatment, data = matched)
ggsurvplot(fit_km, data = matched, risk.table = TRUE,
           conf.int = TRUE, palette = c("#4C78A8", "#F58518"),
           legend.labs = c("RWD control (matched)", "Trial"),
           xlab = "Months since index", ylab = "Overall survival")

7.2 Cox proportional hazards — three estimators

Code
hr_naive   <- fit_cox(cohort)              |> mutate(method = "Unadjusted")
hr_matched <- fit_cox(matched)             |> mutate(method = "PS matching (3:1)")
hr_iptw    <- fit_cox(weighted, weights = weighted$siptw_trunc) |>
              mutate(method = "Stabilized IPTW")

bind_rows(hr_naive, hr_matched, hr_iptw) |>
  filter(term == "treatment") |>
  transmute(method,
            HR        = round(estimate, 2),
            `95% CI`  = sprintf("%.2f – %.2f", conf.low, conf.high),
            `p-value` = signif(p.value, 3)) |>
  knitr::kable(caption = "Estimated treatment effect on OS by method.")
Estimated treatment effect on OS by method.
method HR 95% CI p-value
Unadjusted 0.39 0.28 – 0.55 0.00000
PS matching (3:1) 0.74 0.47 – 1.16 0.19100
Stabilized IPTW 0.38 0.21 – 0.68 0.00121

The unadjusted HR is biased downward (artificially favorable to treatment) because the trial arm starts healthier. After PS adjustment the estimate is closer to the true generative HR of 0.70.

8 Sensitivity considerations

For a regulatory submission you would additionally:

  • pre-specify the analysis plan and lock the protocol;
  • assess unmeasured confounding via the E-value (4);
  • run quantitative bias analyses;
  • repeat with alternate matching ratios, calipers, and weight truncation thresholds;
  • assess the proportional hazards assumption (Schoenfeld residuals);
  • consider doubly-robust estimators (e.g., AIPW, TMLE).

9 Reproducibility

Code
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] broom_1.0.12    gtsummary_2.5.0 survminer_0.5.2 ggpubr_0.6.3   
[5] survival_3.6-4  ggplot2_4.0.3   dplyr_1.2.1     ecasim_0.1.0   

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       xfun_0.57          htmlwidgets_1.6.4  MatchIt_4.7.2     
 [5] rstatix_0.7.3      lattice_0.22-6     vctrs_0.7.3        tools_4.4.1       
 [9] generics_0.1.4     tibble_3.3.1       pkgconfig_2.0.3    Matrix_1.7-0      
[13] RColorBrewer_1.1-3 S7_0.2.2           gt_1.3.0           lifecycle_1.0.5   
[17] compiler_4.4.1     farver_2.1.2       stringr_1.6.0      carData_3.0-6     
[21] litedown_0.9       htmltools_0.5.9    sass_0.4.10        yaml_2.3.12       
[25] Formula_1.2-5      pillar_1.11.1      car_3.1-5          tidyr_1.3.2       
[29] abind_1.4-8        commonmark_2.0.0   tidyselect_1.2.1   digest_0.6.39     
[33] stringi_1.8.7      purrr_1.2.2        labeling_0.4.3     splines_4.4.1     
[37] fastmap_1.2.0      grid_4.4.1         cli_3.6.6          magrittr_2.0.5    
[41] base64enc_0.1-6    cards_0.7.1        withr_3.0.2        scales_1.4.0      
[45] backports_1.5.1    cardx_0.3.2        rmarkdown_2.31     otel_0.2.0        
[49] ggtext_0.1.2       gridExtra_2.3      ggsignif_0.6.4     chk_0.10.0        
[53] evaluate_1.0.5     knitr_1.51         markdown_2.0       rlang_1.2.0       
[57] gridtext_0.1.6     Rcpp_1.1.1-1.1     glue_1.8.1         xml2_1.5.2        
[61] jsonlite_2.0.0     R6_2.6.1           fs_2.1.0          

The exact R version and package versions are pinned via the .devcontainer and the package DESCRIPTION. Every figure and table above is regenerated from the same seed (42) every time the report is rendered — so two reviewers running the report on different machines get the same numbers.

References

1.
Rahman R et al. Use of external controls in regulatory decision-making for oncology drugs. Clinical Pharmacology and Therapeutics. 2021.
2.
European Medicines Agency. Real-world evidence framework to support EU regulatory decision-making [Internet]. 2023. Available from: https://www.ema.europa.eu/
3.
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.
4.
VanderWeele TJ, Ding P. Sensitivity analysis in observational research: Introducing the e-value. Annals of Internal Medicine. 2017;167(4):268–74.