instead of using n-gram, in order to predict the next word (based on probabilities), they convert each sentence into a semantic vector --> show similarity --> understand language patterns
2
Deep Residual Learning for Image Recognition He et al. · archive
CV
2015
Done
Why it matters
Introduced Residual Networks (ResNet) & solved the "vanishing gradient" problem
Summary
nstead of learning H(x), learn F(x) = H(x) - x, then add x back (diff bt layers) ---> easier
3
Recursive Language Model Alex L. Zhang et al. · archive
LLM
2026
Done
Why it matters
New paradigm - wanted to explore
Summary
For huge prompts, they used Python to store the input in a variable, then wrote code to process it in smaller chunks and recursively called the model on each chunk to solve the query. (A --> call ---> B ---> call and so on)
4
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Dosovitskiy et al. · ICLR
CV
2020
Done
Why it matters
Transformer architecture could beat standard CNNs at image recognition by splitting images into "patches" (tokens) ---> ViT
Summary
chop images into patches, treat them like words, and throw a standard Transformer at it. 1) At small compute budgets, ResNets and hybrids slightly edge out ViT. 2)As compute increases, ViT overtakes them all
5
Batch Normalization Ioffe & Szegedy · archive
NN
2015
Done
Why it matters
Training stability
Summary
normalize the train data in mini batch --> cal mean and var of each batch ---> xi hat = (xi - mean) / sqrt(var**2 + ϵ) --> yi = \gamma \hat{xi} + \beta
6
Learning Transferable Visual Models From Natural Language Supervision Radford et al. · archive
CV
2021
Done
Why it matters
Introduced CLIP and the backbone for modern zero-shot image classification and multimodal understanding.
Summary
ViTs are limited by fixed categories and expensive labels. Solution : Train on 400 million image-text pairs from the internet with a simple contrastive objective – make matching pairs close, non-matching pairs far ---> CLIP : A Zero-shot model that matches supervised ResNet-50 on ImageNet & way more robust
θ = θ − (momentum 1 /sqrt(momentum 2)) which mom 1 = avg & mom 2 = var
8
A Simple Framework for Contrastive Learning of Visual Representations Geoffrey Hinton et al. · archive
CV
2020
Done
Why it matters
Simplifies self-supervised contrastive learning, showing how heavy data augmentation combined with a contrastive loss can learn powerful visual representations without any labels.
Summary
Different from other models (supervised learning on ImageNet) : data augmentation + non-linear head at the end of network + loss function
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness Robert Geirhos et al. · ICLR
CV
2019
Done
Why it matters
Shows standard CNNs (like ResNet) learn to classify images based on local textures rather than global shapes.
Summary
ImageNet-trained CNNs are texture machines != humans --> Train on Stylized-ImageNet where textures are uninformative ( bias towards shapes) ---> more robust, better at detection, and match human behavior
1) English sentence → Encoder → ONE vector → Decoder → French sentence 2) The alignment model : score = v * tanh(W * decoder_state + U * encoder_state)
RandAugment is a highly practical, state-of-the-art technique for automating data augmentation strategies without needing to manually tune endless augmentation parameters.
Summary
AutoAugment (RL) and friends required expensive separate search on proxy tasks ---> Solution : Reduce the entire augmentation policy to 2 parameters – N (# of transforms) and M (global magnitude). Randomly select transforms, apply with same magnitude.
13
ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky et al. · archive
CV
2015
Done
Why it matters
it fundamentally shifted the paradigm of computer vision from focusing on algorithms to focusing on data scale and quality
Summary
Data + Compute + Deep Networks defeated handcrafted computer vision features.
min G max D V(D,G) D tries to maximize its ability to detect fake samples. G tries to minimize D’s ability to detect them. V(D,G)=Ex∼pdata[logD(x)]+Ez∼pz[log(1−D(G(z)))] If image is real → D should say 1 If image is fake → D should say 0 G wants D to be wrong
Pioneered the shift from manually designed data augmentation to automated, data-driven approaches in computer vision (RL).
Summary
AutoAugment frames augmentation design as a search problem ---> The Search Space: A Policy of Sub-Policies : A policy consists of 5 sub-policies. For each image in a mini-batch, one sub-policy is chosen at random and applied. Each sub-policy has 2 operations (applied sequentially) ---> They use a RNN controller that predicts augmentation policies. The controller has 30 softmax predictions. 1. Train a "child model" (small neural network) on the target dataset using this augmentation policy 2. Evaluate the child model on a validation set to get accuracy 3. Use that accuracy as a reward signal to update the controller via Proximal Policy Optimization (PPO) Insane part : these policies transfer to other datasets and architectures!
The sun rises in the [MASK]. ↓ 1.Masked LM: predict randomly masked words using FULL context 2.Next Sentence Prediction: predict if two sentences are consecutive
If we keep scaling models and data, performance keeps improving smoothly. If you have fixed compute, you should not: make the model extremely huge or use extremely huge data Instead: You should balance them.
A generative model: pθ(z) * pθ(x | z) (prior + decoder) ↓ Problem is that we don't have true posterior pθ(x | z) ↓ Approximate the marginal likelihood pθ(x) ---> qϕ(z∣x) ↓ Use the reparameterization trick to get low-variance gradients ↓ Optimize the variational lower bound (ELBO) with SGD
Real Image ↓ Eventually → pure random noise ↓ x0 → x1 → x2 → ... → xT ↓ xT → xT-1 → xT-2 → ... → x0 ↓ pθ(xt−1∣xt) ↓ Given a noisy sample, predict the less noisy version.
connected diffusion models with denoising score matching
Summary
We define a Markov chain: Forward Process ↓ q(xt∣xt−1) = N(xt;1−βt xt−1,βtI) ↓ Add a little Gaussian noise xT∼N(0,I) ↓ Reverse process (learned) : we want : pθ(xt−1∣xt) --> ϵθ(xt,t) ↓ Loss=E[∣∣ϵ−ϵθ(xt,t)∣∣2] ↓ Given a noisy sample, predict the less noisy version."
In transformer layer, instead of FeedForward(x), we can use: ↓ Expert_i(x) (only one selected) ↓ chosen by : router(x) → probabilities over experts ↓ Then : pick argmax → choose one expert
How diffusion became practical — behind the images you generate
Summary
1.Train an autoencoder to compress images into a small, meaningful latent space — do this once. ↓ 2.Train a diffusion model (U-Net with cross-attention) to denoise in that latent space, conditioned on text or other signals. ↓ 3. At inference: sample noise in latent space, run the U-Net iteratively to denoise it, then decode back to pixels.
Standard attention: Compute → write → read → compute → write → read ↓ FlashAttention: Load once → compute everything → write once ↓ To do this, they calculate softmax in another way : normal softmax : softmax(xi)=∑jexjexi ↓ Decomposable Softmax : m(x)=max(x),ℓ(x)=i∑exi−m(x) ↓ Then for two blocks 𝑥(1),𝑥(2) m=max(m1,m2) & ℓ=em1−mℓ1+em2−mℓ2 ↓ It is not straightforward, ask chatgpt fot proof
ϕi(f,x)=∑z′⊆x′ ∣z′∣!(M−∣z′∣−1)! / M! [fx(z′)−fx(z′∖i)] for each subset z' of features : 1) Calculate the weight |z'|!(M - |z'| - 1)!/M! 2) Calculate the marginal contribution [f_x(z') - f_x(z' \ i)] 3)Sum it all up Gives the fair contribution of feature i to the prediction