Question 1
What does the term 'multi-headed attention' refer to?
Question 2
Which layer updates embeddings based on the context in transformer models?
Question 3
What is the purpose of the softmax function in the attention mechanism?
Question 4
What is the significance of embeddings in transformer models?
Question 5
What role does the attention mechanism play in understanding context?
Question 6
What kind of influence does the attention mechanism have on embeddings?
Question 7
What is the primary objective of a transformer model?
Question 8
Why is the parallelizability of the attention mechanism important?
Question 9
What differentiates self-attention from cross-attention?
Question 10
What method is used to prevent later tokens from influencing earlier tokens in self-attention?
Question 11
What was the main highlight of the 2017 paper 'Attention is All You Need'?
Question 12
What is the approximate parameter count for one attention head in a transformer?
Question 13
What happens during the masking process in the attention mechanism?
Question 14
In transformers, which matrix helps in creating value vectors?
Question 15
How is the query vector (q) formed during an attention head computation?