Temporal Fusion Transformer: architecture, intuition, and Pelican walkthrough

The Short Version

TFT is a supervised deep-learning model that takes a lookback window and produces forecasts for several future horizons. Its main trick is not just "use attention." Its main trick is architectural discipline: separate the information into static features, past-observed time-varying features, and future-known time-varying features, then let each kind of information enter the model through the path where it is most useful.

In plain English: TFT tries to answer three questions at once. What kind of entity is this? What has been happening recently? What do we already know about the future? It then decides which inputs matter, models short-term sequential behavior, uses attention to choose the important historical time points, and emits a full uncertainty-aware forecast.

The architecture was introduced in "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting" by Lim et al. It became influential because it joined several useful ideas that had often been handled separately: covariate-aware forecasting, local recurrent sequence modeling, transformer-style temporal attention, static conditioning, feature selection, gating, residual learning, and quantile loss.

Drawn Architecture

This diagram shows the high-level data flow. The exact tensor shapes vary by implementation, but the conceptual paths are stable: static metadata conditions the network; past and future covariates pass through variable selection; LSTMs encode local sequence behavior; attention decides which encoded time points matter for each forecast horizon; the output head emits quantiles.

Inputs Variable selection Sequential modeling Attention Forecast output

How We Use It In Pelican

In Pelican, the TFT is not a generic stock-price oracle. It is a sequence model used inside a defined-risk options decision system. The active production family is tft_classical_sequence_v1. Its job is to rescore concrete option-spread candidates after the system has built current market features, generated candidate spreads, and checked the candidate's own quote and structure details.

The important design choice is candidate-level forecasting. A row is not just "AAPL at 10:00." It is closer to "this ticker, this timestamp, this bull-put or bear-call structure, this expiry, these strikes, this width, this quote, this liquidity, and this historical context." That makes the forecast directly usable for ranking and sizing trades.

1. Snapshot Market state arrives

Price, option chains, volume, open interest, implied distributions, calendar state, and news-derived state.

2. Sequence 28 observed steps

The encoder sees a recent sequence of market snapshots, including freshness and missingness signals.

3. Candidate Spread is specified

Structure, expiry, strikes, width, credit, Greeks, moneyness, bid-ask, quote age, OI, and volume.

4. Forecast TFT + GBM distribution

The TFT forecast is blended with a GBM quantile ensemble through a calibrated CDF mixture.

5. Decision Edge after costs

Distribution turns into candidate EV, tail risk, p20 sizing edge, rank, and gate-visible no-trade reasons.

The model does not authorize trades by itself. It feeds a post-cost decision path: candidate generation, calibrated distribution pricing, liquidity checks, duplicate checks, risk limits, broker-safe execution gates, and operator-visible reasons when no candidate is eligible.

What The Training Data Looks Like

The current sequence-TFT training store is deliberately structured around the live decision contract. It has a sequence side, a candidate side, and an expiry-aligned underlying-return label for the candidate horizon. The no-leakage proof for the current store passes, which matters more than model cleverness in this kind of system.

3,105encoder sequences

86,605sequence time steps

4,969candidate rows and training rows

28observed steps per encoder sequence

Sequence features

The encoder sees market and options state through time: implied-distribution percentiles, IV rank, put-call ratios, open-interest and volume structure, spread quality, realized-volatility context, cross-sectional ranks, macro-calendar distances, day-of-week features, earnings flags, freshness indicators, missingness indicators, and a 64-dimensional news state.

Candidate features

Each candidate row describes the actual spread being considered: bull put or bear call, expiration, candidate DTE, short and long strikes, width, bid/mid/ask credit, bid-ask width, leg Greeks, IV, moneyness, distance to strikes, leg open interest, leg volume, quote age, and guard/funnel metadata.

Labels

The target is label_return: the underlying return from the decision snapshot to the candidate's expiry-aligned close. The spread candidate supplies the horizon, structure, strikes, width, and quote context used to turn a return distribution into payoff.

Leakage control

Features are split by what is observed at decision time versus what is known by contract, and the candidate store carries a no-leakage proof. If a feature is not available on the live path, it should not quietly enter training.

Operational Features We Actually Feed

The model input is intentionally unglamorous. Most of the value comes from forcing the same live decision facts into training and inference: what the option surface looked like, what exact spread was executable, how fresh the data was, which events were nearby, and whether the sequence had holes. Feature names matter because every name is a production contract.

Feature family	Concrete examples	Why it matters
Option surface and chain state	`atm_iv`, `atm_iv_xrank`, `chain_impl_ret_p10`, `chain_impl_ret_p50`, `chain_impl_width`, `chain_pc_oi_ratio`, `chain_total_oi`, `chain_median_spread_pct`	Shows what the options market is implying about distribution shape, liquidity, crowding, and transaction quality.
Candidate contract	`structure`, `candidate_dte`, `short_strike`, `long_strike`, `width`, `entry_credit_mid`, `spread_bid_ask_width`, `net_delta`, `short_leg_oi`, `long_leg_volume`	Turns the forecast from a generic ticker view into a decision about one executable bull-put or bear-call spread.
Event and calendar state	`days_to_earnings`, `days_to_fomc`, `days_to_cpi`, `days_to_nfp`, `days_to_monthly_opex`, `dow_sin`	Lets the model separate ordinary drift from known event-risk windows and recurring market-calendar effects.
News and text state	`news_state_00` through `news_state_63`, plus news-state freshness fields.	Condenses recent text context into stable numeric state without letting the model read future news.
Freshness and missingness	`row_observed_mask`, `feature_observed_mask`, `feature_age_minutes`, `source_age_minutes`, `option_surface_age_minutes`, `candidate_quote_age_seconds`	Prevents stale or absent data from looking like a confident numeric signal. Missingness is itself information.

Concrete AAPL sequence slice

One historical AAPL training example used a full 28-step encoder ending near the 2026-02-11 close. The sequence covered snapshots from 2026-02-05 20:30 UTC through 2026-02-11 20:30 UTC, with no missing steps. The table below shows how a few real inputs changed from the first to the last observed step.

Feature	First observed step	Last observed step	Interpretation
`atm_iv`	0.2948	0.2477	At-the-money implied volatility compressed across the lookback window.
`chain_impl_ret_p10`	-2.09%	-3.36%	The option-implied lower-return tail became more negative.
`chain_pc_oi_ratio`	0.458	0.619	Put-call open-interest balance moved toward more put weight.
`chain_total_oi`	551,792	212,969	Available open-interest state changed materially, so liquidity context changed too.
`chain_median_spread_pct`	4.19%	6.03%	The chain became more expensive to trade on median spread quality.
`news_state_00`	0.286	0.180	The compressed news/text state changed, giving the sequence path non-price context.

What Predictions We Make

The useful output is a forecast distribution for the underlying return over the candidate horizon, conditioned on the sequence and the concrete spread candidate. Pelican maps that return distribution through the spread payoff, then asks trading questions: What is the median payoff? How bad is the lower tail? Is the expected value still positive after spread cost, slippage, and execution penalties? Is the edge robust enough across model seeds to size the trade?

Prediction surface	What it means	How the decision path uses it
Candidate quantiles	A calibrated underlying-return distribution conditioned on a specific spread candidate.	Prices the spread's post-cost EV and tail risk instead of ranking on raw heuristics.
GBM/TFT blend	A CDF mixture of a tree ensemble and the sequence TFT.	Balances tabular feature strength with sequential market-context modeling.
Seed/member robustness	Model-member disagreement and lower-tail edge estimates.	Feeds p20-style sizing so position size is based on robust edge, not only mean edge.
Gate-visible reasons	Why a candidate did or did not survive model, quote, liquidity, risk, duplicate, and timing gates.	Keeps operator surfaces honest when the right answer is to stand down.

In other words, the TFT is one part of the ranking brain. The actual decision is the distribution plus execution reality: fresh quotes, liquidity, strike validity, expiry lane, portfolio exposure, replacement logic, and realized fill/slippage evidence.

Labels And One Real AAPL Walkthrough

The label is intentionally simple: label_return = (expiry-aligned close - snapshot close) / snapshot close. It is an underlying-return label, not a direct option-spread P&L label. That separation is important: first learn the calibrated price-return distribution, then apply the actual spread payoff function, quote, cost, and execution gates.

Historical label example

AAPL, 2026-02-11 21:00 UTC snapshot, 2026-03-13 expiry, bear_call, 280/285 strikes, 5-point width. The stored label_return is -9.42%. For that structure, the underlying moving down is directionally favorable, but the label itself is still the underlying move. Payoff is computed after the distribution is mapped through the spread.

Sequence contract

The same row links to encoder sequence seq-c9ea29edf44ea084ca9e: 28 observed steps, full coverage, ending at 2026-02-11 20:30 UTC. The candidate row carries the 30-DTE contract, strikes, width, credit, moneyness, quote age, and funnel metadata.

Live-style decision path: AAPL bull put

A real production decision artifact from 2026-05-14 15:03 ET scored an AAPL bull_put candidate: short 290, long 285, expiring 2026-06-05, with AAPL at 298.52. The executable credit was 0.92 on a 5-point-wide spread, so max profit was 0.92 and max loss was 4.08.

Decision field	Value	Meaning
Horizon	22 DTE	The model scored the return distribution to the spread's expiry horizon.
Encoder quality	28 observed steps, sequence score pass	The sequence TFT had the expected recent market context available.
Blended fair credit	0.64 vs market credit 0.92	The model viewed the market credit as richer than fair value, creating edge.
Profit probability	85.8%	The payoff map estimated high probability of finishing above breakeven.
Payoff distribution	p05 -4.08, p50 0.92, p95 0.92	The trade is defined-risk: lower tail loses the spread max loss; upper outcomes cap at the credit.
Ranking edge	6.87%	Positive model edge before the final execution and sizing reality checks.
Post-cost EV	0.188 per contract after execution penalty	The candidate still had positive expected value after the learned execution penalty.
Robust sizing edge	-15.74% p20 seed edge	The lower seed/member view was not robust enough to size aggressively.
Final gate	Dropped: width 1.67% of spot < 2.00% minimum	The system did not submit it. A forecast is not a trade; gates still decide executability.

This is the operational point of the architecture. The TFT helps estimate the distribution, but the trading system prices the actual spread, checks robustness, charges execution cost, enforces width/liquidity/risk rules, and preserves a clear reason when the correct action is no trade.

The Architecture Pieces

TFT is useful because each module has a job. It keeps time availability explicit, lets feature relevance change by regime, uses recurrence for local order, uses attention for retrieval across the lookback window, and emits a distribution instead of a single point estimate.

Module	Role	Pelican interpretation
Typed inputs	Separate static, past-observed, and future-known covariates.	Keep live-available market state separate from label-only future outcomes.
Variable selection	Learn which features matter for this entity, time, and horizon.	Let option-surface, news, calendar, and quote features matter differently by ticker and regime.
Static context	Condition the whole network on entity identity.	Let AAPL, SPY, NVDA, and IWM use shared structure without pretending their signals mean exactly the same thing.
LSTM path	Model local sequence order, recency, and short-run dynamics.	Read the 28-step market sequence before scoring the current spread candidate.
Gated residual networks	Apply nonlinear transforms only when they help.	Reduce brittleness when some feature blocks are stale, noisy, or low-signal.
Attention	Retrieve the historical time steps most relevant to the forecast horizon.	Expose useful diagnostics about which recent market windows influenced the score.
Quantile heads	Train with pinball loss to produce calibrated lower, median, and upper outcomes.	Turn one forecast into fair value, lower-tail payoff, profit probability, and seed-robust sizing inputs.

Why TFT Works

The strongest reason TFT works is constraint. It is expressive, but it is not shapeless. It forces the model to respect when information is known, where entity context belongs, how recent sequence dynamics enter, and why uncertainty needs to be forecast directly.

Time availability is explicit

Past-observed, future-known, and static features do not share one undisciplined input pipe. That is the first guard against leakage.

Feature relevance can move

IV rank, event distance, put-call ratios, and news state should not have fixed importance across all tickers and market regimes.

Local order and longer memory both matter

The LSTM path handles recency and ordering; attention lets the model retrieve older points in the lookback window when they matter.

Distribution beats point forecast

Defined-risk options need lower tails, medians, fair values, and uncertainty width. A single expected return is not enough.

Implementation And Failure Modes

A strong TFT implementation is mostly a data-contract problem. The model can be clever only after feature availability, labels, splits, calibration, and production parity are enforced.

Data contract

Mark every feature as static, past-observed, or future-known.
Guarantee no future target leakage into observed covariates.
Version the lookback length, horizon contract, feature schema, and scaler state.

Training

Normalize continuous variables consistently, often per entity or group.
Use quantile loss for the required output quantiles.
Evaluate by horizon, not only averaged across all horizons.

Validation

Compare against seasonal naive, linear, GBM, and simpler neural baselines.
Run feature ablations and entity-slice diagnostics.
Backtest the downstream decision rule, not just forecast error.

Production

Version the feature schema, scalers, embeddings, and horizon contract.
Fail closed on missing required future-known covariates.
Monitor freshness, drift, calibration, and per-horizon error.

Failure mode	Symptom	Guardrail
Leakage	Validation looks excellent but live scoring collapses.	Feature-availability typing plus no-leakage proofs.
Miscalibration	p10/p90 intervals are too narrow or too wide out of sample.	Replay-derived PIT checks, calibration windows, and post-model calibration layers.
Operational mismatch	Training features exist offline but are stale, delayed, or absent live.	Freshness/missingness features, live parity checks, and fail-closed inference contracts.
False interpretability	Variable weights or attention are treated as causal proof.	Use them as diagnostics only, then validate with ablations, holdouts, and decision backtests.