f(x₁...xₙ) = Attention(Q, K, V)
Attention mechanism:
1. Query (Q): What we're looking for
2. Key (K): What we match against
3. Value (V): What we retrieve
Attention = softmax(QK^T)V
This is how transformers "pay attention" to relevant parts of the input,
allowing them to handle long-range dependencies and context.