Causal Incentive Allocation Study Guide

Core claim

This is a causal decision system, not better targeting.

The shift is from rules that select likely converters to a system that estimates incremental value, estimates incremental cost, and allocates incentives under economic constraints.

1. Identify incrementality

Use randomized assignment and clear outcome definitions so observed behavior can be interpreted causally.

2. Learn heterogeneous effects

Estimate user-action treatment effects separately for value and cost, then calibrate and stress-test rankings.

3. Optimize allocation

Convert predictions into user-level assignments that maximize expected incremental value under budget and efficiency guardrails.

4. Keep learning valid

Preserve exploration, log propensities, evaluate candidate policies offline, then validate online.

Interview framing: a heuristic coupon system asks who is likely to order. A causal incentive system asks where spending changes behavior enough to justify the cost.

Starting point

Why heuristic campaigns break down.

A typical pre-ML baseline is a lifecycle rule: if a user has crossed an inactivity threshold, send a fixed reactivation offer. That rule is easy to explain but structurally limited.

No counterfactual

It pays some users who would have returned without an incentive.

No depth personalization

It cannot decide whether a smaller, larger, or no offer is best for this specific user.

No portfolio intelligence

It cannot reliably translate a spend target into the best set of user-offer assignments.

Wrong objective

Predicting conversion propensity is not the same as estimating incremental lift per unit cost.

Formal problem

Estimate value lift, estimate cost lift, then optimize.

For each eligible context-user-action pair \((x,a)\), estimate incremental value:

\[ \tau^v(x,a) = \mathbb{E}[Y^v(a)-Y^v(0)\mid X=x] \]

Estimate incremental cost:

\[ \tau^c(x,a) = \mathbb{E}[Y^c(a)-Y^c(0)\mid X=x] \]

One useful efficiency convention is value per incremental cost:

\[ \mathrm{CPIV}(x,a)=\frac{\hat{\tau}^{v}(x,a)}{\hat{\tau}^{c}(x,a)} \]

Another convention is cost per incremental value, useful when ranking lower is better:

\[ \widehat{CPI}(x,a)=\frac{\hat{\tau}^{c}(x,a)}{\hat{\tau}^{v}(x,a)} \]

The assignment problem is a constrained binary optimization:

\[ \max_{S_{x,a}\in\{0,1\}}\sum_{x,a}S_{x,a}\hat{\tau}^{v}(x,a) \quad\mathrm{s.t.}\quad \sum_{x,a}S_{x,a}\hat{\tau}^{c}(x,a)\le B \]

Additional constraints encode guardrails: treatment volume, marginal efficiency, eligibility, channel limits, and pacing.

Identification

Causal labels come before modeling.

If assignment follows historical targeting rules, observed outcomes confound user intent with treatment effect. Randomization and logging define what can be learned.

Identification map

Binary treatment per campaign variant

In the simplest setting, define eligibility, randomize send versus withhold, and estimate conditional average treatment effects:

\[ \tau(x)=\mathbb{E}[Y(1)-Y(0)\mid X=x] \]

Multi-action treatment

For richer in-app actions, the policy action space includes many possible treatments plus no-offer:

\[ a\in\mathcal{A}\cup\{\mathrm{no\ offer}\} \]

Each logged decision needs the probability of the action that was taken. Those propensities are mandatory for off-policy evaluation.

Identification pitfalls

Action-space explosion: many action cells thin out per-cell support and raise HTE variance.
Long-horizon labels: conceptually appealing windows can be too noisy for learning stable uplift rankings.
Interference: concurrent incentives can violate isolation assumptions and contaminate per-program lift.

Warm-start exploration

Exploration can be skewed toward high-value, low-cost regions to reduce data-generation waste. Since skewed exploration changes the sample geometry, unbiased population estimates require inverse propensity weighting:

\[ \hat{\bar{Y}}= \frac{\sum_i Y_i/\pi_i(W_i)} {\sum_i 1/\pi_i(W_i)} \]

Uplift modeling

Model the effect, not the outcome.

Model decomposition

Why an X-Learner style architecture

It handles treatment-control imbalance well.
It estimates heterogeneous effects directly.
It often outperforms simpler baselines in incentive environments where treatment effects vary sharply across users.

Per-variant modeling

Bootstrap training data across multiple resamples.
Train an uplift learner per resample.
Aggregate CATE predictions across bootstraps.
Maintain separate heads for incremental cost and incremental value.

Feature strategy

Feature sets should respect lifecycle regimes. Activation, habituation, and resurrection have different behavioral dynamics, so an early one-size-fits-all model can blur treatment effects.

Hyperparameter selection

Choose models by incremental calibration, ranking monotonicity in holdout bins, and stability across splits. The practical question is: does this model produce a reliable ordering for decisioning?

Why value is harder than cost

Cost labels are usually denser and easier to calibrate. Value uplift can be lower signal-to-noise, especially when outcomes are sparse or delayed. That motivates separate heads, heavier value diagnostics, and shorter-horizon value proxies for ranking.

Labels and horizons

Learning horizon and accounting horizon are different objects.

Cost label

Incremental in-period incentive cost, credit deposited, or realized redemption depending on the channel and finance definition.

Value label

Short-horizon proxies for stable ranking, plus longer-horizon business metrics for reporting and planning.

A decision score can be expressed as a net present value proxy:

\[ \mathrm{score}_a(x)=\widehat{\Delta V}(x,a)-\lambda\cdot\widehat{\mathrm{credit}}(x,a) \]

Training windows should match intervention dynamics. If redemption and behavioral effects are short-cycle, forcing a long noisy window into the learning objective can weaken uplift ordering.

Decision layer

Predictions become assignments through pruning, ranking, optimization, and pacing.

Pruning

Remove user-action pairs that are economically invalid before the optimizer sees them:

\[ \hat{\tau}^{v}(x,a)<0 \]

Also remove pairs whose cost-per-incremental-value exceeds guardrails.

Ranking

Rank survivors by incremental efficiency. Cost-only ordering is insufficient; the ranking must be value-aware.

Optimization

Maximize predicted incremental value.
Obey budget from the pacer.
Obey marginal efficiency and eligibility guardrails.

Useful phrasing: the business chooses spend; the system chooses allocation.

Budget pacing

Models estimate liability, while finance controls realized in-period spend. The pacer adjusts model output using timing calibration and bias calibration:

\[ B_{\mathrm{today}}= B_{\mathrm{period}}- (\mathrm{actual\ spend\ to\ date}+\mathrm{expected\ remaining\ liability}) \]

Budget pacing concept

Off-policy evaluation

Evaluate candidate policies before shipping them.

For each user, score actions including no-offer:

\[ a^*(x)=\arg\max_a\widehat{\mathrm{score}}_a(x) \]

Then apply cutoffs to control volume and efficiency. Before online deployment, estimate candidate policy value on logged randomized data.

Inverse propensity scoring

\[ \hat{V}_{\mathrm{IPS}}(\pi)= \frac{1}{n}\sum_i \frac{\mathbb{1}\{W_i=\pi(X_i)\}}{\hat{p}(W_i\mid X_i)} Y_i \]

Doubly robust estimator

\[ \hat{V}_{\mathrm{DR}}(\pi)= \frac{1}{n}\sum_i \left[ \hat{\mu}(\pi(X_i),X_i) + \frac{\mathbb{1}\{W_i=\pi(X_i)\}}{\hat{p}(W_i\mid X_i)} \left(Y_i-\hat{\mu}(W_i,X_i)\right) \right] \]

Policy-value metrics from propensity-aware estimators are often more decision-relevant than uplift shape metrics alone, because they evaluate the actual action rule under deployment-like constraints.

Portfolio effects

Interference is structure, not noise.

When multiple incentive systems run concurrently, measured treatment effects can shift because another program also affects outcomes. Holdout interpretation becomes conditional on the current portfolio state. Per-program lift can be biased if cross-program assignment is not logged and modeled.

Log all assignment surfaces.
Include competing-treatment features where possible.
Partition audiences when stacking control is unavailable.
Move toward coordinated experimentation across systems.

Evaluation stack

What is good enough to ship?

Evaluation pipeline

Label calibration

Bin-level predicted versus actual checks, separately for cost and value.

Ranking validity

Lower predicted cost per value should correspond to better realized efficiency; ordering should be monotonic and stable.

Optimization realism

Assignments must respect budget, pruning rules, volume limits, and marginal guardrails under edge cases.

Policy value

Use IPS/DR against the baseline policy with objective-consistent outcomes before online validation.

Action-space lesson

Large action spaces fail by starving each cell of evidence.

A very large crossed action experiment can underperform even when the idea is right. With fixed total sample, each treatment cell gets thinner support, so heterogeneous treatment effect estimates become noisy and unstable.

Replacement strategy

Start with a compact, interpretable action set.
Identify strong action families.
Expand locally around promising regions.
Re-estimate the policy with updated propensities.

This preserves learning velocity while maintaining identifiability.

Lessons

Design rules for production causal allocation.

#	Lesson	Mechanism	Design rule
1	Long-horizon effects are hard to identify	Sparse, noisy incremental signal over long windows	Train on shorter-horizon proxies for ranking; keep longer horizons for reporting
2	Large action spaces fail fast	Per-cell sample support collapses	Start compact, then expand locally
3	Cost modeling is usually easier than value uplift	Denser labels and lower variance	Separate model heads and calibration workflows
4	Campaign averages hide actionable heterogeneity	Tail users can be efficient even if the campaign mean is not	Optimize at user-offer level
5	Non-compliance matters	Assigned action can differ from realized action	Use logged propensities and realized treatment for OPE
6	Interference is structure, not noise	Concurrent incentives shift outcomes	Log all surfaces and model overlap explicitly
7	Delayed redemption breaks naive budgeting	Liability timing differs from assignment timing	Add pacing with timing and bias calibration
8	Ranking quality matters more than absolute calibration for allocation	The optimizer uses ordering at the margin	Prioritize monotonic ranking diagnostics
9	AUC-like metrics can be insufficient	Shape metrics may not track policy value under logged propensities	Use IPS/DR policy-value criteria for model selection
10	Exploration policy is part of the estimator	Skewed assignment changes sample geometry	Log propensities and use IPW correction
11	One global model can be worse early	Lifecycle regimes differ	Use lifecycle-specific models until pooled evidence is strong
12	Objective mismatch creates hidden regressions	Training metric and allocation objective diverge	Align scoring objective with deployment economics
13	Static thresholds decay over time	Environment and offer mix drift	Recalibrate regularly and maintain continuous evaluation
14	No-offer is a first-class action	Forced treatment creates spend leakage	Include no-offer in the action space
15	Data generation is an optimization problem too	Random exploration can be expensive and slow	Use warm start plus correction
16	Eligibility design is itself a model	The wrong candidate pool limits every optimizer	Iterate eligibility boundaries as part of policy design
17	Counterfactual logging quality sets the ceiling	Missing or inconsistent logs break OPE	Treat logging schema as core ML infrastructure
18	Feature parity matters across train and serve	Offline-online mismatch degrades policy	Enforce feature contracts and monitoring

One-page methodology

The production loop.

Identify -> Learn -> Decide -> Validate

The methodological contribution is not a single model class. It is the integration of causal identification, uplift estimation, propensity-aware policy evaluation, and constrained optimization with pacing. That integration turns incentives from campaign heuristics into an adaptive control system.

Appendix

Equation block.

CATE CPIV CPI Constrained optimization IPS DR

\[ \tau(x)=\mathbb{E}[Y(1)-Y(0)\mid X=x] \]

\[ \mathrm{CPIV}(x,a)=\frac{\hat{\tau}^{v}(x,a)}{\hat{\tau}^{c}(x,a)}, \qquad \widehat{CPI}(x,a)=\frac{\hat{\tau}^{c}(x,a)}{\hat{\tau}^{v}(x,a)} \]

\[ \max_{\{S_t(x,a)\in\{0,1\}\}} \sum_{x,a}S_t(x,a)\hat{\tau}^{v}(x,a) \quad\mathrm{s.t.}\quad \sum_{x,a}S_t(x,a)\hat{\tau}^{c}(x,a)\le B_t \]

\[ \hat{V}_{\mathrm{IPS}}(\pi)= \frac{1}{N}\sum_i \frac{\mathbb{1}[W_i=\pi(X_i)]}{\hat{p}(W_i\mid X_i)}Y_i \]

\[ \hat{V}_{\mathrm{DR}}(\pi)= \frac{1}{N}\sum_i \left[ \hat{\mu}_{\pi(X_i)}(X_i) + \frac{\mathbb{1}[W_i=\pi(X_i)]}{\hat{p}(W_i\mid X_i)} \left(Y_i-\hat{\mu}_{W_i}(X_i)\right) \right] \]