Convergent Evolution:
How Different Language Models Learn Similar Number Representations

University of Southern California  ·  UC San Diego
📡
Do pretrained LLMs develop Fourier spikes at $T = 2, 5, 10$?
Universal
9 pretrained LLMs, classical word embeddings, and even the raw number-token frequency distribution all spike at $T = 2, 5, 10$.
🧭
Do they all encode $n \bmod T$ linearly?
3 of 4
Among four 300M-param architectures we trained on 10B FineWeb-Edu, Transformer, Gated DeltaNet, and Mamba-2 develop linearly separable mod-$T$ classes; the LSTM stays at chance, despite a larger Fourier spike.
📐
Why can spectrum and geometry come apart?
Theorem 1
Fourier sparsity is necessary but not sufficient for mod-$T$ linear separability; the within-class scatter controls the gap.
🛤️
How does geometric convergence emerge?
Two routes
(1) complementary text co-occurrence signals in pretraining; (2) multi-token addition, where per-digit carries turn each output position into a modular subproblem.
Fourier spectrum of number embeddings across diverse model families
Universality of Fourier features. Across Transformer LLMs (GPT-2, GPT-OSS, Llama-3, Llama-4, DeepSeek-V3), non-Transformer LLMs (Mamba, Falcon-Mamba, xLSTM, Kimi-Linear), and classical word embeddings (GloVe, FastText), number token embeddings consistently show spikes at periods T = 2, 5, 10 in the Fourier domain.

Abstract

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at T = 2, 5, 10. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-T spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-T.

To explain this incongruity, we prove that Fourier-domain sparsity is necessary but not sufficient for mod-T geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: from complementary co-occurrence signals in general language data (including text–number co-occurrence and cross-number interaction), or from multi-token (but not single-token) addition problems.

Overall, our results highlight the phenomenon of convergent evolution in feature learning: a diverse range of models learn similar features from different training signals.

The key dissociation

We separate two types of convergence that prior work has often conflated. One is easy to get; the other is hard-won.

Universal

Spectral convergence

Fourier spikes at T = 2, 5, 10. Every pretrained LLM we checked has them, and so does the raw number-token frequency distribution with no model at all. Spectral convergence reflects the statistics of the training data, not learned structure.

Selective

Geometric convergence

Residue classes n mod T are linearly separable in the embedding. This requires the data, architecture, and optimizer to align. Under our 300M-parameter, 10B-token setup, a Transformer reaches Cohen's κ = 96 at T = 2, while an LSTM trained on identical data sits at chance, even though its Fourier power is larger.

Fourier spikes vs probe accuracy, the dissociation between spectrum and geometry
A “spiky” Fourier spectrum does not imply good feature learning. (Left) Transformer, Gated DeltaNet, LSTM, and even the raw token-frequency distribution all show period-T spikes. (Middle) Only the Transformer and Gated DeltaNet achieve high Cohen's κ for mod-T probes; the LSTM and the raw distribution stay near chance. (Right) Theorem 1 explains the gap via the Fisher discriminant of the within-class scatter.

Key takeaways

📡

Spikes are cheap

Even the raw number-token frequency histogram produces the same T = 2, 5, 10 spikes. A spike in a model tells you about the training data, not about understanding.

🧭

Geometry is earned

Linear mod-T separability is a much stronger property than Fourier sparsity. We prove (Theorem 1) that spikes are necessary but not sufficient.

🔬

Structure attribution

Controlled perturbations of the training distribution attribute representations to specific structural properties of the data, giving a complementary lens to influence functions.

🛤️

Two routes to geometry

Geometric convergence arises either from complementary co-occurrence signals in general text, or from multi-token addition training where carries force modular subproblems.

🏗️

Architecture matters

With identical data and compute, Transformers and Gated DeltaNet learn geometrically separable features; LSTMs do not, even though they develop more prominent Fourier spikes.

🔢

Tokenization matters

9-digit (multi-token) addition forces a mod-1000 subproblem at each output position via carry propagation, producing circular representations. 3-digit (single-token) addition imposes no such constraint; outcomes depend on optimizer and seed.

A biological analogy

Cladogram of convergent evolution in number representations
A cladogram of the systems studied in this paper, organized by which form of convergence each exhibits. Distinct architectures independently develop similar number representations under shared training pressures, analogous to convergent evolution in biology.

In biology, convergent evolution describes unrelated organisms independently developing similar traits under shared environmental pressures. The eyes of vertebrates and cephalopods are the canonical example. Fourier features in number embeddings fit the same pattern: a shared trait that emerges across radically different systems because they share constraints from training data and tokenization.

Contributions

Results at a glance

Effect of data perturbations on probe accuracy vs. Fourier spectrum
Geometric convergence depends on the data signal. Perturbations (Swap Numbers, Unigram Replace, Isolate-k, short context) leave Fourier spectra nearly identical, but mod-T probing degrades substantially. Unigram Replace falls to chance.
Architecture ablation on mod-T probing
Architecture matters. Under matched data and compute, Transformers and Gated DeltaNet reach strong mod-T probe accuracy; LSTMs remain near chance despite showing equally prominent Fourier spikes.
Tokenization determines convergence in arithmetic
Tokenization determines convergence in arithmetic. In 9-digit (multi-token) addition, both Muon and AdamW converge to the same spectral structure and near-perfect κ for mod 2, 5, 10. In 3-digit (single-token) addition, the Fourier spectra vary across optimizer and seed, and κ remains near chance.

For the full walkthrough with intuition, theorem statement, and a worked counterexample, see the blog post.

Citation

@misc{fu2026convergent,
  title         = {Convergent Evolution: How Different Language Models Learn Similar Number Representations},
  author        = {Deqing Fu and Tianyi Zhou and Mikhail Belkin and Vatsal Sharan and Robin Jia},
  year          = {2026},
  eprint        = {2604.20817},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2604.20817}
}