Reading List

Showing 66 of 66 papers

#	Title & Authors	Cat	Year	Status
1	Neural Probabilistic Language Model Bengio et al. · Journal of Machine Learning Research	NLP	2003	Done
	Why it matters Foundation of neural language modelin Summary instead of using n-gram, in order to predict the next word (based on probabilities), they convert each sentence into a semantic vector --> show similarity --> understand language patterns
2	Deep Residual Learning for Image Recognition He et al. · archive	CV	2015	Done
	Why it matters Introduced Residual Networks (ResNet) & solved the "vanishing gradient" problem Summary nstead of learning H(x), learn F(x) = H(x) - x, then add x back (diff bt layers) ---> easier
3	Recursive Language Model Alex L. Zhang et al. · archive	LLM	2026	Done
	Why it matters New paradigm - wanted to explore Summary For huge prompts, they used Python to store the input in a variable, then wrote code to process it in smaller chunks and recursively called the model on each chunk to solve the query. (A --> call ---> B ---> call and so on)
4	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Dosovitskiy et al. · ICLR	CV	2020	Done
	Why it matters Transformer architecture could beat standard CNNs at image recognition by splitting images into "patches" (tokens) ---> ViT Summary chop images into patches, treat them like words, and throw a standard Transformer at it. 1) At small compute budgets, ResNets and hybrids slightly edge out ViT. 2)As compute increases, ViT overtakes them all
5	Batch Normalization Ioffe & Szegedy · archive	NN	2015	Done
	Why it matters Training stability Summary normalize the train data in mini batch --> cal mean and var of each batch ---> xi hat = (xi - mean) / sqrt(var**2 + ϵ) --> yi = \gamma \hat{xi} + \beta
6	Learning Transferable Visual Models From Natural Language Supervision Radford et al. · archive	CV	2021	Done
	Why it matters Introduced CLIP and the backbone for modern zero-shot image classification and multimodal understanding. Summary ViTs are limited by fixed categories and expensive labels. Solution : Train on 400 million image-text pairs from the internet with a simple contrastive objective – make matching pairs close, non-matching pairs far ---> CLIP : A Zero-shot model that matches supervised ResNet-50 on ImageNet & way more robust
7	Adam Optimizer Kingma & Ba · ICLR	NN	2014	Done
	Why it matters Adaptive learning rates Summary θ = θ − (momentum 1 /sqrt(momentum 2)) which mom 1 = avg & mom 2 = var
8	A Simple Framework for Contrastive Learning of Visual Representations Geoffrey Hinton et al. · archive	CV	2020	Done
	Why it matters Simplifies self-supervised contrastive learning, showing how heavy data augmentation combined with a contrastive loss can learn powerful visual representations without any labels. Summary Different from other models (supervised learning on ImageNet) : data augmentation + non-linear head at the end of network + loss function
9	Word2Vec Mikolov et al. · archive	NLP	2013	Done
	Why it matters Learned embeddings — foundation of all modern NLP Summary vector("King")−vector("Man")+vector("Woman")≈vector("Queen")
10	ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness Robert Geirhos et al. · ICLR	CV	2019	Done
	Why it matters Shows standard CNNs (like ResNet) learn to classify images based on local textures rather than global shapes. Summary ImageNet-trained CNNs are texture machines != humans --> Train on Stylized-ImageNet where textures are uninformative ( bias towards shapes) ---> more robust, better at detection, and match human behavior
11	Seq2Seq with Attention Bahdanau et al. · ICLR	LLM	2014	Done
	Why it matters Where the attention mechanism was born Summary 1) English sentence → Encoder → ONE vector → Decoder → French sentence 2) The alignment model : score = v * tanh(W * decoder_state + U * encoder_state)
12	RandAugment: Practical automated data augmentation with a reduced search space Ekin D. Cubuk et al. · archive	CV	2019	Done
	Why it matters RandAugment is a highly practical, state-of-the-art technique for automating data augmentation strategies without needing to manually tune endless augmentation parameters. Summary AutoAugment (RL) and friends required expensive separate search on proxy tasks ---> Solution : Reduce the entire augmentation policy to 2 parameters – N (# of transforms) and M (global magnitude). Randomly select transforms, apply with same magnitude.
13	ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky et al. · archive	CV	2015	Done
	Why it matters it fundamentally shifted the paradigm of computer vision from focusing on algorithms to focusing on data scale and quality Summary Data + Compute + Deep Networks defeated handcrafted computer vision features.
14	Generative Adversarial Networks (GAN) Goodfellow et al. · archive	GAN	2014	Done
	Why it matters Generative modeling before diffusion took over Summary min G max D V(D,G) D tries to maximize its ability to detect fake samples. G tries to minimize D’s ability to detect them. V(D,G)=Ex∼pdata[logD(x)]+Ez∼pz[log(1−D(G(z)))] If image is real → D should say 1 If image is fake → D should say 0 G wants D to be wrong
15	AutoAugment: Learning Augmentation Policies from Data Ekin D. Cubu et al. · archive	CV	2019	Done
	Why it matters Pioneered the shift from manually designed data augmentation to automated, data-driven approaches in computer vision (RL). Summary AutoAugment frames augmentation design as a search problem ---> The Search Space: A Policy of Sub-Policies : A policy consists of 5 sub-policies. For each image in a mini-batch, one sub-policy is chosen at random and applied. Each sub-policy has 2 operations (applied sequentially) ---> They use a RNN controller that predicts augmentation policies. The controller has 30 softmax predictions. 1. Train a "child model" (small neural network) on the target dataset using this augmentation policy 2. Evaluate the child model on a validation set to get accuracy 3. Use that accuracy as a reward signal to update the controller via Proximal Policy Optimization (PPO) Insane part : these policies transfer to other datasets and architectures!
16	Attention Is All You Need Vaswani et al. · archive	LLM	2017	Done
	Why it matters THE paper. Start of modern AI. Summary Attention(Q,K,V) = softmax(QK^T / √d_k) V PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
17	A Simple Framework for Contrastive Learning of Visual Representations Hinton et al. · archive	CV	2020	Done
	Why it matters Bridged the performance gap between self-supervised learning (SSL) and supervised learning in computer vision + InfoNCE Loss Summary 1.Original image ↓ Two augmented versions ↓ 2.image → neural network → vector (representation) ↓ 3.Image ↓ Encoder (ResNet) ↓ Projection head (MLP) ↓ Representation used for training ↓ Contrastive Loss
18	BERT Devlin et al. · archive	LLM	2018	Done
	Why it matters Bidirectional pretraining revolution Summary The sun rises in the [MASK]. ↓ 1.Masked LM: predict randomly masked words using FULL context 2.Next Sentence Prediction: predict if two sentences are consecutive
19	GPT-2 Radford et al. · ---	LLM	2019	Done
	Why it matters Scale works + "too dangerous to release" Summary huge data + huge model → learn many tasks automatically
20	FaceNet: A Unified Embedding for Face Recognition and Clustering Schroff et al. · archive	CV	2015	Done
	Why it matters Triplet Loss Summary A triplet contains: 1. Anchor face 2. Positive face (same person) 3. Negative face (different person) Goal : distance(anchor, positive) < distance(anchor, negative)
21	Language Models are Few-Shot Learners Brown et al. · archive	LLM	2020	Done
	Why it matters Few-shot learning and emergent abilities + GPT-3 Summary Big language model + examples in prompt = new task solved
22	Dimensionality Reduction by Learning an Invariant Mapping LeCun et al. · CVPR	ML	2005	Done
	Why it matters The contrastive loss + Siamese Networks Summary Instead of classifying data directly, learn a space where similarity becomes distance.
23	Scaling Laws for Neural Language Models Kaplan et al. · archive	LLM	2020	Done
	Why it matters The empirical case for bigger = better Summary If we keep scaling models and data, performance keeps improving smoothly. If you have fixed compute, you should not: make the model extremely huge or use extremely huge data Instead: You should balance them.
24	Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling · archive	GEN-AI	2013	Done
	Why it matters Variational Autoencoders (VAEs) Summary A generative model: pθ(z) * pθ(x \| z) (prior + decoder) ↓ Problem is that we don't have true posterior pθ(x \| z) ↓ Approximate the marginal likelihood pθ(x) ---> qϕ(z∣x) ↓ Use the reparameterization trick to get low-variance gradients ↓ Optimize the variational lower bound (ELBO) with SGD
25	Chinchilla (Training Compute-Optimal LLMs) Hoffmann et al. · archive	LLM	2022	Done
	Why it matters Balance model size with data — changed how every lab trains Summary Training tokens ≈ 20 × number of parameters
26	Deep Unsupervised Learning using Nonequilibrium Thermodynamics Sohl-Dickstein et al. · archive	CV	2015	Done
	Why it matters intro to diffusion Models Summary Real Image ↓ Eventually → pure random noise ↓ x0 → x1 → x2 → ... → xT ↓ xT → xT-1 → xT-2 → ... → x0 ↓ pθ(xt−1∣xt) ↓ Given a noisy sample, predict the less noisy version.
27	Denoising Diffusion Probabilistic Models Jonathan Ho et al. · NeurIPS	CV	2020	Done
	Why it matters connected diffusion models with denoising score matching Summary We define a Markov chain: Forward Process ↓ q(xt∣xt−1) = N(xt;1−βt xt−1,βtI) ↓ Add a little Gaussian noise xT∼N(0,I) ↓ Reverse process (learned) : we want : pθ(xt−1∣xt) --> ϵθ(xt,t) ↓ Loss=E[∣∣ϵ−ϵθ(xt,t)∣∣2] ↓ Given a noisy sample, predict the less noisy version."
28	Switch Transformer (Mixture of Experts) Fedus et al. · archive	LLM	2021	Done
	Why it matters Scale without proportional compute cost Summary In transformer layer, instead of FeedForward(x), we can use: ↓ Expert_i(x) (only one selected) ↓ chosen by : router(x) → probabilities over experts ↓ Then : pick argmax → choose one expert
29	Latent Diffusion (Stable Diffusion) Rombach et al. · archive	CV	2022	Done
	Why it matters How diffusion became practical — behind the images you generate Summary 1.Train an autoencoder to compress images into a small, meaningful latent space — do this once. ↓ 2.Train a diffusion model (U-Net with cross-attention) to denoise in that latent space, conditioned on text or other signals. ↓ 3. At inference: sample noise in latent space, run the U-Net iteratively to denoise it, then decode back to pixels.
30	U-Net: Convolutional Networks for Biomedical Image Segmentation Ronneberger et al. · archive	CV	2015	Done
	Why it matters introduced a highly efficient encoder-decoder architecture with skip connections, enabling accurate image segmentation—particularly in medical imaging Summary Contracting path (encoder) ↓ bottleneck Expanding path (decoder) ↑
31	Training language models to follow instructions with human feedback Ouyang et al. · archive	LLM	2022	Done
	Why it matters Why ChatGPT behaves like ChatGPT Summary maxE[rϕ(x,y)] using PPO↓ Objective : Reward−β⋅KL(policy∣∣basemodel) They don’t let the model drift too far.
32	Chain-of-Thought Prompting Wei et al. · archive	LLM	2022	Done
	Why it matters Foundation of reasoning models (o1 / DeepSeek-R1) Summary transforming inference into a structured latent-variable generation process ↓ p(y∣x)=z∑p(y∣z,x)p(z∣x)
33	LoRA Hu et al. · archive	LLM	2021	Done
	Why it matters Fine-tune billion-parameter models on one GPU Summary Φ←Φ0+ΔΦ (ΔW = fine tuned mat) ↓ ΔW=BA ↓ W=W0+BA --> h=W0x+BAx ---> h = w0x+Bz ↓ LoRA learns: “Which r-dimensional subspace of the input-output mapping should be modified?” ↓ FIrst : A∼N(0,2) and 𝐵=0 ---> ΔW=0 ↓ ΔWx←rαBAx --> rank(ΔW)≪d
34	FlashAttention Dao et al. · archive	LLM	2022	Done
	Why it matters Why you can have 100K+ context windows today Summary Standard attention: Compute → write → read → compute → write → read ↓ FlashAttention: Load once → compute everything → write once ↓ To do this, they calculate softmax in another way : normal softmax : softmax(xi)=∑jexjexi ↓ Decomposable Softmax : m(x)=max(x),ℓ(x)=i∑exi−m(x) ↓ Then for two blocks 𝑥(1),𝑥(2) m=max(m1,m2) & ℓ=em1−mℓ1+em2−mℓ2 ↓ It is not straightforward, ask chatgpt fot proof
35	DPO (Direct Preference Optimization) Rafailov et al. · archive	RL	2023	To Read
	Why it matters Simpler alternative to RLHF — most open-source models use this now
36	LLaMA Touvron et al.		2023	To Read
	Why it matters Kicked off the open-source LLM movement
37	RAG (Retrieval-Augmented Generation) Lewis et al.		2020	To Read
	Why it matters How most production LLM apps work today
38	LLaVA (Visual Instruction Tuning) Liu et al.		2023	To Read
	Why it matters Multimodal LLMs — simple and open
39	Mamba (State Space Models) Gu & Dao		2023	To Read
	Why it matters What if transformers aren't the final form?
40	DeepSeek-R1 DeepSeek-AI		2025	To Read
	Why it matters Cheap frontier models + reasoning via RL
41	Segment Anything (SAM) Kirillov et al.		2023	To Read
	Why it matters Foundation model for vision — capstone for your vision interest
42	sentence bert			To Read
43	Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation			To Read
44	Generative Modeling by Estimating Gradients of the Data Distribution			To Read
45	Score-Based Generative Modeling through Stochastic Differential Equations			To Read
46	Bayesian Learning via Stochastic Gradient Langevin Dynamics			To Read
47	RoBERTa: A Robustly Optimized BERT Pretraining Approach			To Read
48	Flow Matching for Generative Modeling Lipman et al.			To Read
49	An Introduction to Flow Matching and Diffusion Models Peter Holderrieth, Ezra Erives			To Read
50	Training Compute-Optimal Large Language Models			To Read
	Why it matters Chinchilla law
51	DeepSeek LLM: Scaling Open-Source Language Models with Longtermism DeepSeek-AI et al.			To Read
52	BloombergGPT: A Large Language Model for Finance			To Read
53	"Why Should I Trust You?": Explaining the Predictions of Any Classifier Tulio Ribeiro et al.	xAI	2016	In Progress
	Why it matters LIME Summary 2) min g𝛜G L(f,g,𝜋x0)+𝛀(g) 3) 𝜋x'(z') = exp(-D(x',z)2 / 𝝈2) 4) L(f,g,𝜋x0) = 𝛴 𝜋x'(z') (fx(z0) g(z0))**2
54	A Unified Approach to Interpreting Model Predictions Scott Lundberg, Su-In Lee	xAI		In Progress
	Why it matters SHAP Summary ϕi(f,x)=∑z′⊆x′ ∣z′∣!(M−∣z′∣−1)! / M! [fx(z′)−fx(z′∖i)] for each subset z' of features : 1) Calculate the weight \|z'\|!(M - \|z'\| - 1)!/M! 2) Calculate the marginal contribution [f_x(z') - f_x(z' \ i)] 3)Sum it all up Gives the fair contribution of feature i to the prediction
55	SmoothGrad: removing noise by adding noise	xAI		To Read
56	TabTransformer: Tabular Data Modeling Using Contextual Embeddings			To Read
57	FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention			To Read
58	SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training			To Read
59	TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models			To Read
60	Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data			To Read
61	TabNet: Attentive Interpretable Tabular Learning			To Read
62	MLP-Mixer: An all-MLP Architecture for Vision			To Read
63	DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks			To Read
64	Neural Additive Models: Interpretable Machine Learning with Neural Nets			To Read
65	NODE-GAM: Neural Generalized Additive Model for Interpretable Deep Learning			To Read
66	TabDDPM: Modelling Tabular Data with Diffusion Models			To Read