Interview study guide

Causal Incentive Allocation

A technical guide to moving from heuristic growth campaigns to causal, constrained, ML-driven user-level allocation.

Public-safe version. Company-specific dollar values, exact thresholds, internal code names, and private experiment counts are intentionally omitted. The methodology, equations, and operating lessons are preserved.

Core claim

This is a causal decision system, not better targeting.

The shift is from rules that select likely converters to a system that estimates incremental value, estimates incremental cost, and allocates incentives under economic constraints.

1. Identify incrementality

Use randomized assignment and clear outcome definitions so observed behavior can be interpreted causally.

2. Learn heterogeneous effects

Estimate user-action treatment effects separately for value and cost, then calibrate and stress-test rankings.

3. Optimize allocation

Convert predictions into user-level assignments that maximize expected incremental value under budget and efficiency guardrails.

4. Keep learning valid

Preserve exploration, log propensities, evaluate candidate policies offline, then validate online.

Interview framing: a heuristic coupon system asks who is likely to order. A causal incentive system asks where spending changes behavior enough to justify the cost.

Starting point

Why heuristic campaigns break down.

A typical pre-ML baseline is a lifecycle rule: if a user has crossed an inactivity threshold, send a fixed reactivation offer. That rule is easy to explain but structurally limited.

No counterfactual

It pays some users who would have returned without an incentive.

No depth personalization

It cannot decide whether a smaller, larger, or no offer is best for this specific user.

No portfolio intelligence

It cannot reliably translate a spend target into the best set of user-offer assignments.

Wrong objective

Predicting conversion propensity is not the same as estimating incremental lift per unit cost.

Formal problem

Estimate value lift, estimate cost lift, then optimize.

For each eligible context-user-action pair \((x,a)\), estimate incremental value:

\[ \tau^v(x,a) = \mathbb{E}[Y^v(a)-Y^v(0)\mid X=x] \]

Estimate incremental cost:

\[ \tau^c(x,a) = \mathbb{E}[Y^c(a)-Y^c(0)\mid X=x] \]

One useful efficiency convention is value per incremental cost:

\[ \mathrm{CPIV}(x,a)=\frac{\hat{\tau}^{v}(x,a)}{\hat{\tau}^{c}(x,a)} \]

Another convention is cost per incremental value, useful when ranking lower is better:

\[ \widehat{CPI}(x,a)=\frac{\hat{\tau}^{c}(x,a)}{\hat{\tau}^{v}(x,a)} \]

The assignment problem is a constrained binary optimization:

\[ \max_{S_{x,a}\in\{0,1\}}\sum_{x,a}S_{x,a}\hat{\tau}^{v}(x,a) \quad\mathrm{s.t.}\quad \sum_{x,a}S_{x,a}\hat{\tau}^{c}(x,a)\le B \]

Additional constraints encode guardrails: treatment volume, marginal efficiency, eligibility, channel limits, and pacing.

Identification

Causal labels come before modeling.

If assignment follows historical targeting rules, observed outcomes confound user intent with treatment effect. Randomization and logging define what can be learned.

Identification map

Randomization Propensity logging Estimand identification IPS / DR validity Interference controls

Binary treatment per campaign variant

In the simplest setting, define eligibility, randomize send versus withhold, and estimate conditional average treatment effects:

\[ \tau(x)=\mathbb{E}[Y(1)-Y(0)\mid X=x] \]

Multi-action treatment

For richer in-app actions, the policy action space includes many possible treatments plus no-offer:

\[ a\in\mathcal{A}\cup\{\mathrm{no\ offer}\} \]

Each logged decision needs the probability of the action that was taken. Those propensities are mandatory for off-policy evaluation.

Identification pitfalls

  • Action-space explosion: many action cells thin out per-cell support and raise HTE variance.
  • Long-horizon labels: conceptually appealing windows can be too noisy for learning stable uplift rankings.
  • Interference: concurrent incentives can violate isolation assumptions and contaminate per-program lift.

Warm-start exploration

Exploration can be skewed toward high-value, low-cost regions to reduce data-generation waste. Since skewed exploration changes the sample geometry, unbiased population estimates require inverse propensity weighting:

\[ \hat{\bar{Y}}= \frac{\sum_i Y_i/\pi_i(W_i)} {\sum_i 1/\pi_i(W_i)} \]

Uplift modeling

Model the effect, not the outcome.

Model decomposition

Features Value uplift model Cost uplift model Score ratio / objective

Why an X-Learner style architecture

  • It handles treatment-control imbalance well.
  • It estimates heterogeneous effects directly.
  • It often outperforms simpler baselines in incentive environments where treatment effects vary sharply across users.

Per-variant modeling

  1. Bootstrap training data across multiple resamples.
  2. Train an uplift learner per resample.
  3. Aggregate CATE predictions across bootstraps.
  4. Maintain separate heads for incremental cost and incremental value.

Feature strategy

Feature sets should respect lifecycle regimes. Activation, habituation, and resurrection have different behavioral dynamics, so an early one-size-fits-all model can blur treatment effects.

Hyperparameter selection

Choose models by incremental calibration, ranking monotonicity in holdout bins, and stability across splits. The practical question is: does this model produce a reliable ordering for decisioning?

Why value is harder than cost

Cost labels are usually denser and easier to calibrate. Value uplift can be lower signal-to-noise, especially when outcomes are sparse or delayed. That motivates separate heads, heavier value diagnostics, and shorter-horizon value proxies for ranking.

Labels and horizons

Learning horizon and accounting horizon are different objects.

Cost label

Incremental in-period incentive cost, credit deposited, or realized redemption depending on the channel and finance definition.

Value label

Short-horizon proxies for stable ranking, plus longer-horizon business metrics for reporting and planning.

A decision score can be expressed as a net present value proxy:

\[ \mathrm{score}_a(x)=\widehat{\Delta V}(x,a)-\lambda\cdot\widehat{\mathrm{credit}}(x,a) \]

Training windows should match intervention dynamics. If redemption and behavioral effects are short-cycle, forcing a long noisy window into the learning objective can weaken uplift ordering.

Decision layer

Predictions become assignments through pruning, ranking, optimization, and pacing.

Pruning

Remove user-action pairs that are economically invalid before the optimizer sees them:

\[ \hat{\tau}^{v}(x,a)<0 \]

Also remove pairs whose cost-per-incremental-value exceeds guardrails.

Ranking

Rank survivors by incremental efficiency. Cost-only ordering is insufficient; the ranking must be value-aware.

Optimization

  • Maximize predicted incremental value.
  • Obey budget from the pacer.
  • Obey marginal efficiency and eligibility guardrails.
Useful phrasing: the business chooses spend; the system chooses allocation.

Budget pacing

Models estimate liability, while finance controls realized in-period spend. The pacer adjusts model output using timing calibration and bias calibration:

\[ B_{\mathrm{today}}= B_{\mathrm{period}}- (\mathrm{actual\ spend\ to\ date}+\mathrm{expected\ remaining\ liability}) \]

Budget pacing concept

Predicted liability Actual spend Timing correction Bias correction Budget Optimizer

Off-policy evaluation

Evaluate candidate policies before shipping them.

For each user, score actions including no-offer:

\[ a^*(x)=\arg\max_a\widehat{\mathrm{score}}_a(x) \]

Then apply cutoffs to control volume and efficiency. Before online deployment, estimate candidate policy value on logged randomized data.

Inverse propensity scoring

\[ \hat{V}_{\mathrm{IPS}}(\pi)= \frac{1}{n}\sum_i \frac{\mathbb{1}\{W_i=\pi(X_i)\}}{\hat{p}(W_i\mid X_i)} Y_i \]

Doubly robust estimator

\[ \hat{V}_{\mathrm{DR}}(\pi)= \frac{1}{n}\sum_i \left[ \hat{\mu}(\pi(X_i),X_i) + \frac{\mathbb{1}\{W_i=\pi(X_i)\}}{\hat{p}(W_i\mid X_i)} \left(Y_i-\hat{\mu}(W_i,X_i)\right) \right] \]

Policy-value metrics from propensity-aware estimators are often more decision-relevant than uplift shape metrics alone, because they evaluate the actual action rule under deployment-like constraints.

Portfolio effects

Interference is structure, not noise.

When multiple incentive systems run concurrently, measured treatment effects can shift because another program also affects outcomes. Holdout interpretation becomes conditional on the current portfolio state. Per-program lift can be biased if cross-program assignment is not logged and modeled.

  1. Log all assignment surfaces.
  2. Include competing-treatment features where possible.
  3. Partition audiences when stacking control is unavailable.
  4. Move toward coordinated experimentation across systems.

Evaluation stack

What is good enough to ship?

Evaluation pipeline

Labelcalibration Rankingseparation Optimizationsanity OPE policyvalue Onlinevalidation

Label calibration

Bin-level predicted versus actual checks, separately for cost and value.

Ranking validity

Lower predicted cost per value should correspond to better realized efficiency; ordering should be monotonic and stable.

Optimization realism

Assignments must respect budget, pruning rules, volume limits, and marginal guardrails under edge cases.

Policy value

Use IPS/DR against the baseline policy with objective-consistent outcomes before online validation.

Action-space lesson

Large action spaces fail by starving each cell of evidence.

A very large crossed action experiment can underperform even when the idea is right. With fixed total sample, each treatment cell gets thinner support, so heterogeneous treatment effect estimates become noisy and unstable.

Replacement strategy

  1. Start with a compact, interpretable action set.
  2. Identify strong action families.
  3. Expand locally around promising regions.
  4. Re-estimate the policy with updated propensities.

This preserves learning velocity while maintaining identifiability.

Lessons

Design rules for production causal allocation.

#LessonMechanismDesign rule
1Long-horizon effects are hard to identifySparse, noisy incremental signal over long windowsTrain on shorter-horizon proxies for ranking; keep longer horizons for reporting
2Large action spaces fail fastPer-cell sample support collapsesStart compact, then expand locally
3Cost modeling is usually easier than value upliftDenser labels and lower varianceSeparate model heads and calibration workflows
4Campaign averages hide actionable heterogeneityTail users can be efficient even if the campaign mean is notOptimize at user-offer level
5Non-compliance mattersAssigned action can differ from realized actionUse logged propensities and realized treatment for OPE
6Interference is structure, not noiseConcurrent incentives shift outcomesLog all surfaces and model overlap explicitly
7Delayed redemption breaks naive budgetingLiability timing differs from assignment timingAdd pacing with timing and bias calibration
8Ranking quality matters more than absolute calibration for allocationThe optimizer uses ordering at the marginPrioritize monotonic ranking diagnostics
9AUC-like metrics can be insufficientShape metrics may not track policy value under logged propensitiesUse IPS/DR policy-value criteria for model selection
10Exploration policy is part of the estimatorSkewed assignment changes sample geometryLog propensities and use IPW correction
11One global model can be worse earlyLifecycle regimes differUse lifecycle-specific models until pooled evidence is strong
12Objective mismatch creates hidden regressionsTraining metric and allocation objective divergeAlign scoring objective with deployment economics
13Static thresholds decay over timeEnvironment and offer mix driftRecalibrate regularly and maintain continuous evaluation
14No-offer is a first-class actionForced treatment creates spend leakageInclude no-offer in the action space
15Data generation is an optimization problem tooRandom exploration can be expensive and slowUse warm start plus correction
16Eligibility design is itself a modelThe wrong candidate pool limits every optimizerIterate eligibility boundaries as part of policy design
17Counterfactual logging quality sets the ceilingMissing or inconsistent logs break OPETreat logging schema as core ML infrastructure
18Feature parity matters across train and serveOffline-online mismatch degrades policyEnforce feature contracts and monitoring

One-page methodology

The production loop.

Identify -> Learn -> Decide -> Validate

Identify Randomized assignment Propensity logging Outcome definitions Learn Uplift models Cost/value split Calibration + stability Decide Prune bad edges Rank by efficiency Optimize + pace Validate IPS / DR OPE Online tests Drift checks

The methodological contribution is not a single model class. It is the integration of causal identification, uplift estimation, propensity-aware policy evaluation, and constrained optimization with pacing. That integration turns incentives from campaign heuristics into an adaptive control system.

Appendix

Equation block.

CATE CPIV CPI Constrained optimization IPS DR
\[ \tau(x)=\mathbb{E}[Y(1)-Y(0)\mid X=x] \]
\[ \mathrm{CPIV}(x,a)=\frac{\hat{\tau}^{v}(x,a)}{\hat{\tau}^{c}(x,a)}, \qquad \widehat{CPI}(x,a)=\frac{\hat{\tau}^{c}(x,a)}{\hat{\tau}^{v}(x,a)} \]
\[ \max_{\{S_t(x,a)\in\{0,1\}\}} \sum_{x,a}S_t(x,a)\hat{\tau}^{v}(x,a) \quad\mathrm{s.t.}\quad \sum_{x,a}S_t(x,a)\hat{\tau}^{c}(x,a)\le B_t \]
\[ \hat{V}_{\mathrm{IPS}}(\pi)= \frac{1}{N}\sum_i \frac{\mathbb{1}[W_i=\pi(X_i)]}{\hat{p}(W_i\mid X_i)}Y_i \]
\[ \hat{V}_{\mathrm{DR}}(\pi)= \frac{1}{N}\sum_i \left[ \hat{\mu}_{\pi(X_i)}(X_i) + \frac{\mathbb{1}[W_i=\pi(X_i)]}{\hat{p}(W_i\mid X_i)} \left(Y_i-\hat{\mu}_{W_i}(X_i)\right) \right] \]