Core claim
This is a causal decision system, not better targeting.
The shift is from rules that select likely converters to a system that estimates incremental value, estimates incremental cost, and allocates incentives under economic constraints.
1. Identify incrementality
Use randomized assignment and clear outcome definitions so observed behavior can be interpreted causally.
2. Learn heterogeneous effects
Estimate user-action treatment effects separately for value and cost, then calibrate and stress-test rankings.
3. Optimize allocation
Convert predictions into user-level assignments that maximize expected incremental value under budget and efficiency guardrails.
4. Keep learning valid
Preserve exploration, log propensities, evaluate candidate policies offline, then validate online.
Starting point
Why heuristic campaigns break down.
A typical pre-ML baseline is a lifecycle rule: if a user has crossed an inactivity threshold, send a fixed reactivation offer. That rule is easy to explain but structurally limited.
No counterfactual
It pays some users who would have returned without an incentive.
No depth personalization
It cannot decide whether a smaller, larger, or no offer is best for this specific user.
No portfolio intelligence
It cannot reliably translate a spend target into the best set of user-offer assignments.
Wrong objective
Predicting conversion propensity is not the same as estimating incremental lift per unit cost.
Formal problem
Estimate value lift, estimate cost lift, then optimize.
For each eligible context-user-action pair \((x,a)\), estimate incremental value:
Estimate incremental cost:
One useful efficiency convention is value per incremental cost:
Another convention is cost per incremental value, useful when ranking lower is better:
The assignment problem is a constrained binary optimization:
Additional constraints encode guardrails: treatment volume, marginal efficiency, eligibility, channel limits, and pacing.
Identification
Causal labels come before modeling.
If assignment follows historical targeting rules, observed outcomes confound user intent with treatment effect. Randomization and logging define what can be learned.
Identification map
Binary treatment per campaign variant
In the simplest setting, define eligibility, randomize send versus withhold, and estimate conditional average treatment effects:
Multi-action treatment
For richer in-app actions, the policy action space includes many possible treatments plus no-offer:
Each logged decision needs the probability of the action that was taken. Those propensities are mandatory for off-policy evaluation.
Identification pitfalls
- Action-space explosion: many action cells thin out per-cell support and raise HTE variance.
- Long-horizon labels: conceptually appealing windows can be too noisy for learning stable uplift rankings.
- Interference: concurrent incentives can violate isolation assumptions and contaminate per-program lift.
Warm-start exploration
Exploration can be skewed toward high-value, low-cost regions to reduce data-generation waste. Since skewed exploration changes the sample geometry, unbiased population estimates require inverse propensity weighting:
Uplift modeling
Model the effect, not the outcome.
Model decomposition
Why an X-Learner style architecture
- It handles treatment-control imbalance well.
- It estimates heterogeneous effects directly.
- It often outperforms simpler baselines in incentive environments where treatment effects vary sharply across users.
Per-variant modeling
- Bootstrap training data across multiple resamples.
- Train an uplift learner per resample.
- Aggregate CATE predictions across bootstraps.
- Maintain separate heads for incremental cost and incremental value.
Feature strategy
Feature sets should respect lifecycle regimes. Activation, habituation, and resurrection have different behavioral dynamics, so an early one-size-fits-all model can blur treatment effects.
Hyperparameter selection
Choose models by incremental calibration, ranking monotonicity in holdout bins, and stability across splits. The practical question is: does this model produce a reliable ordering for decisioning?
Why value is harder than cost
Cost labels are usually denser and easier to calibrate. Value uplift can be lower signal-to-noise, especially when outcomes are sparse or delayed. That motivates separate heads, heavier value diagnostics, and shorter-horizon value proxies for ranking.
Labels and horizons
Learning horizon and accounting horizon are different objects.
Cost label
Incremental in-period incentive cost, credit deposited, or realized redemption depending on the channel and finance definition.
Value label
Short-horizon proxies for stable ranking, plus longer-horizon business metrics for reporting and planning.
A decision score can be expressed as a net present value proxy:
Training windows should match intervention dynamics. If redemption and behavioral effects are short-cycle, forcing a long noisy window into the learning objective can weaken uplift ordering.
Decision layer
Predictions become assignments through pruning, ranking, optimization, and pacing.
Pruning
Remove user-action pairs that are economically invalid before the optimizer sees them:
Also remove pairs whose cost-per-incremental-value exceeds guardrails.
Ranking
Rank survivors by incremental efficiency. Cost-only ordering is insufficient; the ranking must be value-aware.
Optimization
- Maximize predicted incremental value.
- Obey budget from the pacer.
- Obey marginal efficiency and eligibility guardrails.
Budget pacing
Models estimate liability, while finance controls realized in-period spend. The pacer adjusts model output using timing calibration and bias calibration:
Budget pacing concept
Off-policy evaluation
Evaluate candidate policies before shipping them.
For each user, score actions including no-offer:
Then apply cutoffs to control volume and efficiency. Before online deployment, estimate candidate policy value on logged randomized data.
Inverse propensity scoring
Doubly robust estimator
Policy-value metrics from propensity-aware estimators are often more decision-relevant than uplift shape metrics alone, because they evaluate the actual action rule under deployment-like constraints.
Portfolio effects
Interference is structure, not noise.
When multiple incentive systems run concurrently, measured treatment effects can shift because another program also affects outcomes. Holdout interpretation becomes conditional on the current portfolio state. Per-program lift can be biased if cross-program assignment is not logged and modeled.
- Log all assignment surfaces.
- Include competing-treatment features where possible.
- Partition audiences when stacking control is unavailable.
- Move toward coordinated experimentation across systems.
Evaluation stack
What is good enough to ship?
Evaluation pipeline
Label calibration
Bin-level predicted versus actual checks, separately for cost and value.
Ranking validity
Lower predicted cost per value should correspond to better realized efficiency; ordering should be monotonic and stable.
Optimization realism
Assignments must respect budget, pruning rules, volume limits, and marginal guardrails under edge cases.
Policy value
Use IPS/DR against the baseline policy with objective-consistent outcomes before online validation.
Action-space lesson
Large action spaces fail by starving each cell of evidence.
A very large crossed action experiment can underperform even when the idea is right. With fixed total sample, each treatment cell gets thinner support, so heterogeneous treatment effect estimates become noisy and unstable.
Replacement strategy
- Start with a compact, interpretable action set.
- Identify strong action families.
- Expand locally around promising regions.
- Re-estimate the policy with updated propensities.
This preserves learning velocity while maintaining identifiability.
Lessons
Design rules for production causal allocation.
| # | Lesson | Mechanism | Design rule |
|---|---|---|---|
| 1 | Long-horizon effects are hard to identify | Sparse, noisy incremental signal over long windows | Train on shorter-horizon proxies for ranking; keep longer horizons for reporting |
| 2 | Large action spaces fail fast | Per-cell sample support collapses | Start compact, then expand locally |
| 3 | Cost modeling is usually easier than value uplift | Denser labels and lower variance | Separate model heads and calibration workflows |
| 4 | Campaign averages hide actionable heterogeneity | Tail users can be efficient even if the campaign mean is not | Optimize at user-offer level |
| 5 | Non-compliance matters | Assigned action can differ from realized action | Use logged propensities and realized treatment for OPE |
| 6 | Interference is structure, not noise | Concurrent incentives shift outcomes | Log all surfaces and model overlap explicitly |
| 7 | Delayed redemption breaks naive budgeting | Liability timing differs from assignment timing | Add pacing with timing and bias calibration |
| 8 | Ranking quality matters more than absolute calibration for allocation | The optimizer uses ordering at the margin | Prioritize monotonic ranking diagnostics |
| 9 | AUC-like metrics can be insufficient | Shape metrics may not track policy value under logged propensities | Use IPS/DR policy-value criteria for model selection |
| 10 | Exploration policy is part of the estimator | Skewed assignment changes sample geometry | Log propensities and use IPW correction |
| 11 | One global model can be worse early | Lifecycle regimes differ | Use lifecycle-specific models until pooled evidence is strong |
| 12 | Objective mismatch creates hidden regressions | Training metric and allocation objective diverge | Align scoring objective with deployment economics |
| 13 | Static thresholds decay over time | Environment and offer mix drift | Recalibrate regularly and maintain continuous evaluation |
| 14 | No-offer is a first-class action | Forced treatment creates spend leakage | Include no-offer in the action space |
| 15 | Data generation is an optimization problem too | Random exploration can be expensive and slow | Use warm start plus correction |
| 16 | Eligibility design is itself a model | The wrong candidate pool limits every optimizer | Iterate eligibility boundaries as part of policy design |
| 17 | Counterfactual logging quality sets the ceiling | Missing or inconsistent logs break OPE | Treat logging schema as core ML infrastructure |
| 18 | Feature parity matters across train and serve | Offline-online mismatch degrades policy | Enforce feature contracts and monitoring |
One-page methodology
The production loop.
Identify -> Learn -> Decide -> Validate
The methodological contribution is not a single model class. It is the integration of causal identification, uplift estimation, propensity-aware policy evaluation, and constrained optimization with pacing. That integration turns incentives from campaign heuristics into an adaptive control system.
Appendix