Forecasting Architecture

Temporal Fusion Transformer

A Temporal Fusion Transformer, usually shortened to TFT, is a neural architecture for multi-horizon time-series forecasting. It is built for the messy forecasting cases that show up in production: multiple related entities, many covariates, known future inputs like calendars or prices, static metadata like store or ticker identity, nonlinear interactions, regime changes, and the need to explain which inputs mattered.

The Short Version

TFT is a supervised deep-learning model that takes a lookback window and produces forecasts for several future horizons. Its main trick is not just "use attention." Its main trick is architectural discipline: separate the information into static features, past-observed time-varying features, and future-known time-varying features, then let each kind of information enter the model through the path where it is most useful.

In plain English: TFT tries to answer three questions at once. What kind of entity is this? What has been happening recently? What do we already know about the future? It then decides which inputs matter, models short-term sequential behavior, uses attention to choose the important historical time points, and emits a full uncertainty-aware forecast.

The architecture was introduced in "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting" by Lim et al. It became influential because it joined several useful ideas that had often been handled separately: covariate-aware forecasting, local recurrent sequence modeling, transformer-style temporal attention, static conditioning, feature selection, gating, residual learning, and quantile loss.

Drawn Architecture

This diagram shows the high-level data flow. The exact tensor shapes vary by implementation, but the conceptual paths are stable: static metadata conditions the network; past and future covariates pass through variable selection; LSTMs encode local sequence behavior; attention decides which encoded time points matter for each forecast horizon; the output head emits quantiles.

Temporal Fusion Transformer architecture A block diagram showing static inputs, past observed inputs, future known inputs, variable selection, static enrichment, recurrent encoders, gated residual networks, interpretable attention, and quantile forecast heads. Temporal Fusion Transformer: from raw covariates to multi-horizon quantiles Solid arrows are main data flow. Dashed arrows are conditioning and gating paths from static context. Static inputs entity id, sector, geography store type, ticker metadata Past observed inputs target history y[t-k:t] observed covariates lagged signals, realized state Future known inputs calendar, price plan events, scheduled actions forecast horizon markers Static covariate encoder creates context vectors for selection, LSTM state, enrichment, and gating Past variable selection network per-time-step feature weights choose useful historical inputs Future variable selection network weights known future features for each forecast horizon LSTM encoder reads selected past sequence captures local order, recency, short-run dynamics LSTM decoder rolls through future-known covariates for horizons t+1 ... t+H GRN + gates skip, suppress, refine nonlinear transforms Static enrichment injects entity context into each temporal representation before attention Interpretable multi-head attention which historical time points matter for each horizon? Quantile heads p10, p50, p90 forecasts for every horizon Forecast distribution y_hat[t+1:t+H, q] static context conditions feature weights static context initializes and gates sequence path attention weights are inspectable across time
Inputs Variable selection Sequential modeling Attention Forecast output

How We Use It In Pelican

In Pelican, the TFT is not a generic stock-price oracle. It is a sequence model used inside a defined-risk options decision system. The active production family is tft_classical_sequence_v1. Its job is to rescore concrete option-spread candidates after the system has built current market features, generated candidate spreads, and checked the candidate's own quote and structure details.

The important design choice is candidate-level forecasting. A row is not just "AAPL at 10:00." It is closer to "this ticker, this timestamp, this bull-put or bear-call structure, this expiry, these strikes, this width, this quote, this liquidity, and this historical context." That makes the forecast directly usable for ranking and sizing trades.

1. Snapshot Market state arrives

Price, option chains, volume, open interest, implied distributions, calendar state, and news-derived state.

2. Sequence 28 observed steps

The encoder sees a recent sequence of market snapshots, including freshness and missingness signals.

3. Candidate Spread is specified

Structure, expiry, strikes, width, credit, Greeks, moneyness, bid-ask, quote age, OI, and volume.

4. Forecast TFT + GBM distribution

The TFT forecast is blended with a GBM quantile ensemble through a calibrated CDF mixture.

5. Decision Edge after costs

Distribution turns into candidate EV, tail risk, p20 sizing edge, rank, and gate-visible no-trade reasons.

The model does not authorize trades by itself. It feeds a post-cost decision path: candidate generation, calibrated distribution pricing, liquidity checks, duplicate checks, risk limits, broker-safe execution gates, and operator-visible reasons when no candidate is eligible.

What The Training Data Looks Like

The current sequence-TFT training store is deliberately structured around the live decision contract. It has a sequence side, a candidate side, and an expiry-aligned underlying-return label for the candidate horizon. The no-leakage proof for the current store passes, which matters more than model cleverness in this kind of system.

3,105encoder sequences
86,605sequence time steps
4,969candidate rows and training rows
28observed steps per encoder sequence

Sequence features

The encoder sees market and options state through time: implied-distribution percentiles, IV rank, put-call ratios, open-interest and volume structure, spread quality, realized-volatility context, cross-sectional ranks, macro-calendar distances, day-of-week features, earnings flags, freshness indicators, missingness indicators, and a 64-dimensional news state.

Candidate features

Each candidate row describes the actual spread being considered: bull put or bear call, expiration, candidate DTE, short and long strikes, width, bid/mid/ask credit, bid-ask width, leg Greeks, IV, moneyness, distance to strikes, leg open interest, leg volume, quote age, and guard/funnel metadata.

Labels

The target is label_return: the underlying return from the decision snapshot to the candidate's expiry-aligned close. The spread candidate supplies the horizon, structure, strikes, width, and quote context used to turn a return distribution into payoff.

Leakage control

Features are split by what is observed at decision time versus what is known by contract, and the candidate store carries a no-leakage proof. If a feature is not available on the live path, it should not quietly enter training.

Operational Features We Actually Feed

The model input is intentionally unglamorous. Most of the value comes from forcing the same live decision facts into training and inference: what the option surface looked like, what exact spread was executable, how fresh the data was, which events were nearby, and whether the sequence had holes. Feature names matter because every name is a production contract.

Feature family Concrete examples Why it matters
Option surface and chain state atm_iv, atm_iv_xrank, chain_impl_ret_p10, chain_impl_ret_p50, chain_impl_width, chain_pc_oi_ratio, chain_total_oi, chain_median_spread_pct Shows what the options market is implying about distribution shape, liquidity, crowding, and transaction quality.
Candidate contract structure, candidate_dte, short_strike, long_strike, width, entry_credit_mid, spread_bid_ask_width, net_delta, short_leg_oi, long_leg_volume Turns the forecast from a generic ticker view into a decision about one executable bull-put or bear-call spread.
Event and calendar state days_to_earnings, days_to_fomc, days_to_cpi, days_to_nfp, days_to_monthly_opex, dow_sin Lets the model separate ordinary drift from known event-risk windows and recurring market-calendar effects.
News and text state news_state_00 through news_state_63, plus news-state freshness fields. Condenses recent text context into stable numeric state without letting the model read future news.
Freshness and missingness row_observed_mask, feature_observed_mask, feature_age_minutes, source_age_minutes, option_surface_age_minutes, candidate_quote_age_seconds Prevents stale or absent data from looking like a confident numeric signal. Missingness is itself information.

Concrete AAPL sequence slice

One historical AAPL training example used a full 28-step encoder ending near the 2026-02-11 close. The sequence covered snapshots from 2026-02-05 20:30 UTC through 2026-02-11 20:30 UTC, with no missing steps. The table below shows how a few real inputs changed from the first to the last observed step.

Feature First observed step Last observed step Interpretation
atm_iv 0.2948 0.2477 At-the-money implied volatility compressed across the lookback window.
chain_impl_ret_p10 -2.09% -3.36% The option-implied lower-return tail became more negative.
chain_pc_oi_ratio 0.458 0.619 Put-call open-interest balance moved toward more put weight.
chain_total_oi 551,792 212,969 Available open-interest state changed materially, so liquidity context changed too.
chain_median_spread_pct 4.19% 6.03% The chain became more expensive to trade on median spread quality.
news_state_00 0.286 0.180 The compressed news/text state changed, giving the sequence path non-price context.

What Predictions We Make

The useful output is a forecast distribution for the underlying return over the candidate horizon, conditioned on the sequence and the concrete spread candidate. Pelican maps that return distribution through the spread payoff, then asks trading questions: What is the median payoff? How bad is the lower tail? Is the expected value still positive after spread cost, slippage, and execution penalties? Is the edge robust enough across model seeds to size the trade?

Prediction surface What it means How the decision path uses it
Candidate quantiles A calibrated underlying-return distribution conditioned on a specific spread candidate. Prices the spread's post-cost EV and tail risk instead of ranking on raw heuristics.
GBM/TFT blend A CDF mixture of a tree ensemble and the sequence TFT. Balances tabular feature strength with sequential market-context modeling.
Seed/member robustness Model-member disagreement and lower-tail edge estimates. Feeds p20-style sizing so position size is based on robust edge, not only mean edge.
Gate-visible reasons Why a candidate did or did not survive model, quote, liquidity, risk, duplicate, and timing gates. Keeps operator surfaces honest when the right answer is to stand down.

In other words, the TFT is one part of the ranking brain. The actual decision is the distribution plus execution reality: fresh quotes, liquidity, strike validity, expiry lane, portfolio exposure, replacement logic, and realized fill/slippage evidence.

Labels And One Real AAPL Walkthrough

The label is intentionally simple: label_return = (expiry-aligned close - snapshot close) / snapshot close. It is an underlying-return label, not a direct option-spread P&L label. That separation is important: first learn the calibrated price-return distribution, then apply the actual spread payoff function, quote, cost, and execution gates.

Historical label example

AAPL, 2026-02-11 21:00 UTC snapshot, 2026-03-13 expiry, bear_call, 280/285 strikes, 5-point width. The stored label_return is -9.42%. For that structure, the underlying moving down is directionally favorable, but the label itself is still the underlying move. Payoff is computed after the distribution is mapped through the spread.

Sequence contract

The same row links to encoder sequence seq-c9ea29edf44ea084ca9e: 28 observed steps, full coverage, ending at 2026-02-11 20:30 UTC. The candidate row carries the 30-DTE contract, strikes, width, credit, moneyness, quote age, and funnel metadata.

Live-style decision path: AAPL bull put

A real production decision artifact from 2026-05-14 15:03 ET scored an AAPL bull_put candidate: short 290, long 285, expiring 2026-06-05, with AAPL at 298.52. The executable credit was 0.92 on a 5-point-wide spread, so max profit was 0.92 and max loss was 4.08.

Decision field Value Meaning
Horizon 22 DTE The model scored the return distribution to the spread's expiry horizon.
Encoder quality 28 observed steps, sequence score pass The sequence TFT had the expected recent market context available.
Blended fair credit 0.64 vs market credit 0.92 The model viewed the market credit as richer than fair value, creating edge.
Profit probability 85.8% The payoff map estimated high probability of finishing above breakeven.
Payoff distribution p05 -4.08, p50 0.92, p95 0.92 The trade is defined-risk: lower tail loses the spread max loss; upper outcomes cap at the credit.
Ranking edge 6.87% Positive model edge before the final execution and sizing reality checks.
Post-cost EV 0.188 per contract after execution penalty The candidate still had positive expected value after the learned execution penalty.
Robust sizing edge -15.74% p20 seed edge The lower seed/member view was not robust enough to size aggressively.
Final gate Dropped: width 1.67% of spot < 2.00% minimum The system did not submit it. A forecast is not a trade; gates still decide executability.

This is the operational point of the architecture. The TFT helps estimate the distribution, but the trading system prices the actual spread, checks robustness, charges execution cost, enforces width/liquidity/risk rules, and preserves a clear reason when the correct action is no trade.

The Architecture Pieces

TFT is useful because each module has a job. It keeps time availability explicit, lets feature relevance change by regime, uses recurrence for local order, uses attention for retrieval across the lookback window, and emits a distribution instead of a single point estimate.

Module Role Pelican interpretation
Typed inputs Separate static, past-observed, and future-known covariates. Keep live-available market state separate from label-only future outcomes.
Variable selection Learn which features matter for this entity, time, and horizon. Let option-surface, news, calendar, and quote features matter differently by ticker and regime.
Static context Condition the whole network on entity identity. Let AAPL, SPY, NVDA, and IWM use shared structure without pretending their signals mean exactly the same thing.
LSTM path Model local sequence order, recency, and short-run dynamics. Read the 28-step market sequence before scoring the current spread candidate.
Gated residual networks Apply nonlinear transforms only when they help. Reduce brittleness when some feature blocks are stale, noisy, or low-signal.
Attention Retrieve the historical time steps most relevant to the forecast horizon. Expose useful diagnostics about which recent market windows influenced the score.
Quantile heads Train with pinball loss to produce calibrated lower, median, and upper outcomes. Turn one forecast into fair value, lower-tail payoff, profit probability, and seed-robust sizing inputs.

Why TFT Works

The strongest reason TFT works is constraint. It is expressive, but it is not shapeless. It forces the model to respect when information is known, where entity context belongs, how recent sequence dynamics enter, and why uncertainty needs to be forecast directly.

Time availability is explicit

Past-observed, future-known, and static features do not share one undisciplined input pipe. That is the first guard against leakage.

Feature relevance can move

IV rank, event distance, put-call ratios, and news state should not have fixed importance across all tickers and market regimes.

Local order and longer memory both matter

The LSTM path handles recency and ordering; attention lets the model retrieve older points in the lookback window when they matter.

Distribution beats point forecast

Defined-risk options need lower tails, medians, fair values, and uncertainty width. A single expected return is not enough.

Implementation And Failure Modes

A strong TFT implementation is mostly a data-contract problem. The model can be clever only after feature availability, labels, splits, calibration, and production parity are enforced.

Data contract

  • Mark every feature as static, past-observed, or future-known.
  • Guarantee no future target leakage into observed covariates.
  • Version the lookback length, horizon contract, feature schema, and scaler state.

Training

  • Normalize continuous variables consistently, often per entity or group.
  • Use quantile loss for the required output quantiles.
  • Evaluate by horizon, not only averaged across all horizons.

Validation

  • Compare against seasonal naive, linear, GBM, and simpler neural baselines.
  • Run feature ablations and entity-slice diagnostics.
  • Backtest the downstream decision rule, not just forecast error.

Production

  • Version the feature schema, scalers, embeddings, and horizon contract.
  • Fail closed on missing required future-known covariates.
  • Monitor freshness, drift, calibration, and per-horizon error.
Failure mode Symptom Guardrail
Leakage Validation looks excellent but live scoring collapses. Feature-availability typing plus no-leakage proofs.
Miscalibration p10/p90 intervals are too narrow or too wide out of sample. Replay-derived PIT checks, calibration windows, and post-model calibration layers.
Operational mismatch Training features exist offline but are stale, delayed, or absent live. Freshness/missingness features, live parity checks, and fail-closed inference contracts.
False interpretability Variable weights or attention are treated as causal proof. Use them as diagnostics only, then validate with ablations, holdouts, and decision backtests.