# BERT (Bidirectional Encoder Representations from Transformers) Confidence: high Last verified: 2026-05-22 Generation: ai_assisted ## TL;DR BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model introduced by Google in 2018 that revolutionized NLP by reading text bidirectionally — considering both left and right context simultaneously. It achieved state-of-the-art results on 11 NLP benchmarks at launch, including GLUE (80.5% → 82.1% absolute improvement), SQuAD v1.1, and SWAG. ## Core Explanation Unlike previous models (ELMo, GPT) that processed text left-to-right or concatenated separate left-to-right and right-to-left passes, BERT uses a masked language modeling (MLM) objective that allows truly bidirectional context. During pre-training, 15% of input tokens are randomly masked, and the model learns to predict them using context from both directions. BERT's architecture is a multi-layer bidirectional Transformer encoder. The base model has 12 layers (Transformer blocks), 768 hidden dimensions, and 12 attention heads — totaling 110M parameters. The large model doubles layers, hidden size, and heads to 340M parameters. Pre-training used Wikipedia (2.5B words) and BookCorpus (0.8B words) on 16 Cloud TPUs for 4 days. ## Detailed Analysis ### Training Objectives BERT uses two unsupervised pre-training tasks: 1. **Masked Language Modeling (MLM)** : Randomly mask 15% of tokens. Of those masked positions: - 80% replaced with `[MASK]` - 10% replaced with random token (adds noise for robustness) - 10% left unchanged (prevents model from ignoring unmasked tokens) The model predicts the original token at masked positions. 2. **Next Sentence Prediction (NSP)** : Given two sentences A and B, predict if B follows A. 50% of training pairs are consecutive, 50% are random. This was later found to be non-essential (RoBERTa removed it) but was part of the original design. ### Input Representation BERT's input combines three embeddings: - **Token embeddings**: WordPiece tokenization with 30,000 vocabulary - **Segment embeddings**: Learned embedding indicating sentence A vs. B - **Position embeddings**: Learned positional encoding (not fixed sinusoids) Special tokens: `[CLS]` (classification token at start), `[SEP]` (sentence separator). ### Fine-Tuning Paradigm BERT established the "pre-train then fine-tune" paradigm: 1. Pre-train on large unlabeled corpus (Wikipedia + BookCorpus) 2. Fine-tune on downstream task with labeled data (minutes to hours) 3. Minimal task-specific architecture changes (just add a classification layer) ### Key Benchmarks (at launch, 2018) | Task | Previous SOTA | BERT Base | BERT Large | |------|:-----------:|:---------:|:----------:| | GLUE Score | 80.5 | — | 82.1 | | SQuAD v1.1 | — | 88.5 F1 | 93.2 F1 | | SQuAD v2.0 | — | 76.3 F1 | 83.1 F1 | | SWAG | — | — | 86.3 | | MultiNLI | 76.5 | 84.6 | 86.7 | ### Legacy BERT's impact on NLP was profound. It established the Transformer encoder as the dominant architecture for language understanding, inspired a family of variants (RoBERTa, ALBERT, DistilBERT, DeBERTa), and demonstrated that large-scale pre-training with bidirectional context was the key to transfer learning in NLP. Google Scholar (2026) reports over 100,000 citations. ## Further Reading - [BERT Paper](https://arxiv.org/abs/1810.04805): Original paper by Devlin et al. - [The Illustrated BERT](https://jalammar.github.io/illustrated-bert/): Visual walkthrough - [HuggingFace BERT](https://huggingface.co/docs/transformers/model_doc/bert): Implementation and usage