Complete Learning Guide · Data Science

Principal
Component
Analysis

A complete, in-depth guide to understanding PCA from first principles — covering the math, the intuition, Python implementation, and real applications.

Dimensionality Reduction Unsupervised Learning Linear Algebra Feature Engineering scikit-learn
Foundations · 01

What is PCA?

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a dataset with many correlated features into a smaller set of uncorrelated variables called principal components.

Each principal component is a linear combination of the original features, ordered by the amount of variance it explains — PC1 always explains the most variance, PC2 the second most, and so on.

Core Idea
PCA finds the directions (axes) in your feature space along which the data varies the most, then projects your data onto those directions. The result is a new, compressed representation that keeps as much information as possible.
1901
Invented by Karl Pearson
Linear
Transformation type
SVD
Computed via Singular Value Decomposition
Foundations · 02

Why use PCA?

Modern datasets often have hundreds or thousands of features. This creates several problems that PCA is specifically designed to solve.

Curse of Dimensionality

As features increase, the data becomes increasingly sparse. ML models need exponentially more data to learn patterns, and they tend to overfit.

🔗

Multicollinearity

Correlated features carry redundant information. For example, "height in cm" and "height in inches" say the same thing. PCA merges them into one component.

👁

Visualization

Humans can only visualize 2D or 3D. PCA reduces any dataset to 2–3 components so you can actually see the structure — clusters, outliers, patterns.

🚀

Speed

Training a model on 50 features vs 500 features can be 10–100× faster. PCA removes the noise and redundancy before training.

Foundations · 03

The Intuition

📸 The Photography Analogy
When you take a photo of a 3D object, you project it onto a 2D surface (the photo). You lose one dimension, but if you take the photo from the right angle, you capture the most important visual information. PCA works exactly like this — it finds the "best angle" to project your data, preserving as much variance (information) as possible.
🔦 The Shadow Analogy
Shine a flashlight on a 3D object and observe the shadow. The shadow is 2D but its shape depends on the angle of light. PCA finds the angle that casts the largest, most informative shadow — the projection that preserves the most spread (variance) in your data.
📐 Rotating the Axes
PCA is essentially a rotation of your coordinate system. Instead of the original X, Y, Z axes, PCA creates new axes (PC1, PC2, PC3...) that are aligned with the directions of maximum variance in your data. The first new axis (PC1) points in the direction where the data spreads out the most.

Simple Example: Height & Weight

Suppose you have data on people's height and weight. These two features are correlated — taller people tend to weigh more.

What PCA does
PC1 (main direction) — a diagonal axis from "short & light" to "tall & heavy". This single axis captures most of the body size information from both height and weight.

PC2 (perpendicular) — captures the remaining variation: people who deviate from the height-weight trend (muscular, etc.). Much less important.

Result: You compress 2 features into 1 (PC1) while keeping ~95% of the information. You traded a little accuracy for a huge simplification.

Mathematics · Step 1

Standardize the Data

Before applying PCA, we must standardize each feature so that it has mean = 0 and standard deviation = 1. This is called z-score normalization.

z = (x − μ) / σ

where μ is the feature mean and σ is the standard deviation.

⚠️ Why this matters
Without standardization, features with large scales dominate the analysis. For example, if you have "salary in USD" (range: 30,000–200,000) and "age" (range: 20–65), PCA will treat salary as ~3,000× more important than age just because of units. Standardization puts all features on equal footing.

Example: Student Data

StudentMath (raw)Science (raw)Hours (raw)Math (std)Science (std)Hours (std)
Alice85828+0.30+0.22+0.41
Bob60655−1.39−1.24−1.23
Carol929010+1.39+1.35+1.64
Dave70686−0.69−0.79−0.41
Eve787570.000.000.00

After standardization, all three features have mean=0 and std=1. Now PCA treats them equally.

from sklearn.preprocessing import StandardScaler
import numpy as np

data = {
    'math':    [85, 60, 92, 70, 78],
    'science': [82, 65, 90, 68, 75],
    'hours':   [8,  5,  10, 6,  7 ]
}

scaler = StandardScaler()
X_scaled = scaler.fit_transform(pd.DataFrame(data))

# X_scaled[:,0].mean() ≈ 0.0   X_scaled[:,0].std() ≈ 1.0
print(f"Mean: {X_scaled.mean(axis=0).round(10)}")   # [0. 0. 0.]
print(f"Std:  {X_scaled.std(axis=0).round(2)}")     # [1. 1. 1.]
Mathematics · Step 2

Compute the Covariance Matrix

The covariance matrix tells us how much each pair of features varies together. It is an n×n symmetric matrix where n is the number of features.

Cov(X,Y) = Σ [(xᵢ − x̄)(yᵢ − ȳ)] / (n−1)

The diagonal contains the variance of each feature. Off-diagonal elements show how two features move together:

Covariance Matrix — Student Example (3×3)

Color intensity = strength of correlation. Darker blue = stronger positive relationship.

Math Science Hours
Math 1.00 0.97 0.98
Science 0.97 1.00 0.96
Hours 0.98 0.96 1.00
Interpretation
All correlations are close to 0.97–0.98. This means Math, Science, and Study Hours are almost perfectly correlated — they all measure the same underlying thing: academic effort and ability. PCA will compress all three into a single component.
X_cov = np.cov(X_scaled.T)  # transpose: features as rows

print(X_cov.round(2))
# [[1.   0.97 0.98]
#  [0.97 1.   0.96]
#  [0.98 0.96 1.  ]]
Mathematics · Step 3

Eigenvalues & Eigenvectors

This is the mathematical core of PCA. We decompose the covariance matrix to find its eigenvalues and eigenvectors.

Eigenvectors

The New Axes (Directions)

Eigenvectors define the directions of the new feature space — the principal components. They are always perpendicular (orthogonal) to each other.

Each eigenvector is a unit vector (length = 1) pointing in the direction of maximum variance, given that previous directions have already been accounted for.

Eigenvalues

The Importance (Magnitude)

Eigenvalues tell us how much variance each eigenvector (principal component) captures. Larger eigenvalue = more important component.

The sum of all eigenvalues = total variance in the dataset. So each eigenvalue / total = percentage of variance explained.

A · v = λ · v

where A = covariance matrix, v = eigenvector, λ = eigenvalue

Eigenvalues for Student Dataset

PC1
λ=2.94 98.0%
PC2
λ=0.05 99.6%
PC3
λ=0.01 100%
Key Insight
PC1 alone captures 98% of all variance! This means we can reduce 3 features to just 1 component and keep almost all information. The reason: Math, Science, and Hours are so correlated they basically measure one thing.
eigenvalues, eigenvectors = np.linalg.eigh(X_cov)

# Sort by eigenvalue descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues  = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

variance_explained = eigenvalues / eigenvalues.sum()
cumulative         = np.cumsum(variance_explained)

print("Eigenvalues:",         eigenvalues.round(3))  # [2.94  0.05  0.01]
print("Variance explained:",  variance_explained.round(3))
print("Cumulative variance:", cumulative.round(3))
Mathematics · Step 4

Selecting k Components

After computing the eigenvalues, we choose how many principal components (k) to keep. This is one of the most important decisions in PCA.

Three Rules for Choosing k

1

Variance Threshold Rule (most common)

Keep enough components to explain at least 85–95% of cumulative variance. The threshold depends on your use case — visualization needs 2–3 components; models can tolerate 85%.

Σλᵢ / Σλ_all ≥ 0.95 → choose k where cumulative ≥ 95%
2

Kaiser Rule (eigenvalue ≥ 1)

Keep components with eigenvalue ≥ 1.0. The logic: if a component explains less variance than a single original feature (variance = 1 after standardization), it's not worth keeping.

3

Scree Plot (visual elbow method)

Plot the eigenvalues in descending order and look for the "elbow" — the point where they stop dropping sharply. Keep components before the elbow.

98%
PC1
1.6
PC2
0.4
PC3

Clear elbow after PC1 — keep only 1 component for this dataset.

from sklearn.decomposition import PCA

# Option A: specify exact number of components
pca = PCA(n_components=2)

# Option B: specify variance threshold (auto-selects k)
pca = PCA(n_components=0.95)  # keep 95% of variance

pca.fit(X_scaled)
print(f"Components kept: {pca.n_components_}")
print(pca.explained_variance_ratio_)  # per-component %
print(pca.explained_variance_ratio_.cumsum())  # cumulative %
Mathematics · Step 5

Transform the Data

Once we've selected k components, we project the original data onto the new principal component axes. This is a simple matrix multiplication.

X_new = X_scaled · W

where W is the matrix of k eigenvectors (shape: n_features × k)

The result is a new dataset with k columns instead of the original n columns, where each column is uncorrelated with all others.

Transformed Student Data (k=1)

StudentOriginal Data (3 features)PC1 Score (1 feature)Interpretation
Alice85, 82, 8+0.54Above average
Bob60, 65, 5−2.45Well below average
Carol92, 90, 10+2.46Top performer
Dave70, 68, 6−1.22Below average
Eve78, 75, 70.00Exactly average
Result
We compressed 3 features into 1 number (PC1) that represents overall academic performance, retaining 98% of the original information. Carol is clearly the top performer, Bob is the weakest — immediately visible from one number.

To reconstruct the original data (approximately), use inverse_transform:

pca = PCA(n_components=1)
X_pca = pca.fit_transform(X_scaled)        # compress: (5,3) → (5,1)
X_back = pca.inverse_transform(X_pca)     # reconstruct: (5,1) → (5,3)

# X_back ≈ X_scaled (not exact — we lost the 2% in PC2+PC3)
reconstruction_error = np.mean((X_scaled - X_back)**2)
print(f"Reconstruction MSE: {reconstruction_error:.4f}")
In Practice · Code

Complete Python Implementation

Here is a complete, production-ready PCA implementation using scikit-learn with all best practices.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

# ─── 1. Load and prepare data ───────────────────────────────
df = pd.read_csv('dataset.csv')
X = df.drop('target', axis=1)

# ─── 2. Build PCA pipeline ──────────────────────────────────
pipeline = Pipeline([
    ('scaler', StandardScaler()),        # always standardize first!
    ('pca',    PCA(n_components=0.95))  # keep 95% variance
])

X_pca = pipeline.fit_transform(X)
pca   = pipeline.named_steps['pca']

# ─── 3. Examine results ─────────────────────────────────────
k = pca.n_components_
print(f"Original features:  {X.shape[1]}")
print(f"PCA components kept: {k}")
print(f"Variance retained:  {pca.explained_variance_ratio_.sum():.1%}")

# ─── 4. Scree plot ───────────────────────────────────────────
plt.figure(figsize=(8, 4))
plt.bar(range(1, k+1), pca.explained_variance_ratio_, color='#4f8ef7')
plt.plot(range(1, k+1), pca.explained_variance_ratio_.cumsum(), 'r--')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot')
plt.show()

# ─── 5. Use in ML model ─────────────────────────────────────
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

full_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca',    PCA(n_components=0.95)),
    ('clf',    LogisticRegression())
])

scores = cross_val_score(full_pipeline, X, df['target'], cv=5)
print(f"CV Accuracy: {scores.mean():.1%} ± {scores.std():.1%}")
In Practice · Decision

How to Choose k (Explained Variance)

The right value of k depends on your goals. Here is a practical decision guide:

Use CaseRecommended kVariance to keepReason
2D Visualizationk = 2whatever it givesNeed exactly 2 for a scatter plot
3D Visualizationk = 3whatever it givesNeed exactly 3 for 3D plot
ML preprocessingauto (95% rule)≥ 95%Balance: info retention vs speed
Noise removalauto (85% rule)≥ 85%Aggressively remove low-variance noise
Compressiondepends on quality80–90%Trade quality for size/speed
Feature interpretationsmall (3–5)as much as possibleHuman-interpretable components
Important Warning
Using PCA with n_components=0.99 is NOT always better than 0.95. The last 4% of variance often contains mostly noise. Keeping noise hurts model performance. Always validate on a test set!
In Practice · Interpretation

Understanding Loadings

PCA components are linear combinations of original features. The coefficients are called loadings (or component weights). They tell you which original features each principal component is "made of".

PC1 = w₁·feature₁ + w₂·feature₂ + ... + wₙ·featureₙ

Example: Student Loadings

FeaturePC1 LoadingPC2 LoadingInterpretation
Math Score+0.578−0.401Math drives PC1 strongly
Science Score+0.575−0.448Science drives PC1 similarly
Study Hours+0.579+0.800Hours contributes equally to PC1, but uniquely to PC2
Reading the Loadings
PC1: All loadings ≈ +0.578. This means PC1 is the "average" of all three subjects — it captures overall academic performance.

PC2: Study hours loads very differently (+0.80) vs Math/Science (-0.40). PC2 represents students who study a lot but score below expectations — an "effort vs outcome" dimension.
pca = PCA(n_components=2)
pca.fit(X_scaled)

# pca.components_ shape: (n_components, n_features)
loadings = pd.DataFrame(
    pca.components_.T,
    index   = ['math', 'science', 'hours'],
    columns = ['PC1', 'PC2']
)

print(loadings.round(3))
#          PC1    PC2
# math     0.578 -0.401
# science  0.575 -0.448
# hours    0.579  0.800
In Practice · Evaluation

Advantages & Disadvantages

ADVANTAGES
Reduces overfitting by removing noisy features
Dramatically speeds up model training
Removes multicollinearity between features
Enables visualization of high-dimensional data
Can improve model accuracy by removing noise
Unsupervised — no labels needed
Well-studied with strong theoretical guarantees
DISADVANTAGES
Components are hard to interpret (not original features)
Always loses some information (information loss)
Assumes linear relationships only
Sensitive to outliers (use robust PCA if needed)
Requires standardization before use
Choosing k is subjective
Feature importance is lost after transformation
Applications · 01

When to Use PCA

Features > 50

High-dimensional data

When you have many features and suspect many are correlated. Common in genomics, image processing, text analysis.

High correlation

Correlated features

When a correlation matrix shows many values above 0.7–0.8, PCA will compress those into fewer dimensions effectively.

Visualization

Exploring data patterns

When you want to see if natural clusters exist. Reduce to 2D and plot — you'll see groupings immediately.

Overfitting

Model regularization

When your model overfits and adding more data isn't an option. PCA reduces the input space so the model generalizes better.

Interpretability

Need to know which features matter

If you need to report "feature X is important", PCA destroys that. Use LASSO regression or feature importance from tree models instead.

Few features

Small feature set (<10)

PCA adds complexity without much benefit. If you only have 5–10 features, just use them directly.

Non-linear

Non-linear relationships

PCA only finds linear patterns. Use Kernel PCA, t-SNE, or UMAP if the important structures in your data are non-linear.

Classification

Want class-aware reduction

PCA ignores class labels. Use Linear Discriminant Analysis (LDA) instead — it specifically maximizes class separation.

Applications · 02

PCA Variants & Alternatives

Standard PCA has limitations. Here are specialized variants for different situations.

MethodWhen to useKey difference from PCA
Kernel PCANon-linear data structureUses kernel trick to find non-linear components
Sparse PCANeed interpretable componentsForces most loadings to be zero (sparse)
Incremental PCAHuge datasets that don't fit in RAMProcesses data in mini-batches
Randomized PCAVery high-dimensional, speed criticalApproximate SVD — much faster, slight accuracy loss
Robust PCAData with many outliersSeparates low-rank + sparse components
TruncatedSVD / LSASparse text/NLP dataWorks directly on sparse matrices (no centering)
LDAClassification with class labelsMaximizes class separability, not variance
t-SNE2D/3D visualization of clustersNon-linear, preserves local structure, not for ML features
UMAPLarge datasets, visualization, featuresNon-linear, preserves global + local structure, faster than t-SNE
AutoencoderComplex non-linear compressionDeep learning version — learns complex representations
Quick Decision Guide
Linear + unsupervised + fast? → Standard PCA
Non-linear structure? → Kernel PCA or UMAP
Have class labels? → LDA
Visualization only? → t-SNE or UMAP
Text / NLP? → TruncatedSVD (LSA)
Outliers in data? → Robust PCA
Reference · Glossary

Key Terms

Principal Component (PC)
A new axis/feature created by PCA. Each PC is a linear combination of all original features, ordered by variance explained.
Eigenvalue (λ)
A scalar that tells how much variance a principal component captures. Sum of all eigenvalues = total variance in the data.
Eigenvector
A vector that defines the direction of a principal component. Always orthogonal (perpendicular) to other eigenvectors.
Covariance Matrix
An n×n symmetric matrix showing how much each pair of features varies together. PCA computes eigenvectors of this matrix.
Explained Variance Ratio
The fraction of total variance captured by each component: λᵢ / Σλ. Used to decide how many components to keep.
Loadings
The coefficients (weights) that tell how much each original feature contributes to each principal component. Stored in pca.components_.
Scree Plot
A bar/line chart of eigenvalues in descending order. The "elbow" point suggests the optimal number of components k.
Dimensionality Reduction
The process of reducing the number of features while retaining as much information as possible.
Curse of Dimensionality
As features increase, data becomes exponentially sparse, making ML harder. PCA directly addresses this problem.
Multicollinearity
When two or more features are highly correlated. They carry redundant information that PCA compresses into fewer dimensions.
Standardization (Z-score)
Transform features to mean=0, std=1. Required before PCA so large-scale features don't dominate small-scale ones.
SVD (Singular Value Decomposition)
The linear algebra algorithm used to efficiently compute PCA. sklearn's PCA uses SVD internally.
Reconstruction Error
How much information was lost after PCA compression. Measured as MSE between original and reconstructed data.
Biplot
A visualization that shows both PCA scores (data points) and loadings (arrows for each original feature) in the same plot.
Self-Test · Quiz

Test Your Understanding

Click an answer to check if you are right. Explanations appear after answering.

Q1. What does PCA try to maximize when creating principal components?
Mean of the data
Variance in the projected direction
Correlation between features
Number of data points
PCA finds the direction (eigenvector) along which the data spreads out the most — i.e., maximizing variance. This ensures the projection retains as much information as possible.
Q2. Why must we standardize data BEFORE applying PCA?
To make the data normally distributed
To reduce the number of features
So that features with large scales don't dominate the analysis
To remove outliers from the data
Without standardization, a feature with values 0–100,000 would dominate a feature with values 0–1, simply because of units. Standardization (mean=0, std=1) puts all features on equal footing.
Q3. You apply PCA and get eigenvalues: [3.6, 1.2, 0.4, 0.1, 0.08]. How many components should you keep using the Kaiser rule?
1 component
2 components
3 components
5 components (all)
The Kaiser rule says keep components where eigenvalue ≥ 1.0. Here, only PC1 (3.6) and PC2 (1.2) qualify. PC3 (0.4) is below 1.0, so we stop at k=2.
Q4. After PCA, principal components are always:
Identical to the original features
Highly correlated with each other
Uncorrelated (orthogonal) with each other
Always fewer than 3 dimensions
A key property of PCA is that all principal components are orthogonal (perpendicular) to each other — they are completely uncorrelated. This is by mathematical design (eigenvectors of a symmetric matrix are orthogonal).
Q5. When would you use LDA instead of PCA?
When you have too many features
When you have class labels and want to maximize class separation
When the data has outliers
When the data is non-linear
LDA (Linear Discriminant Analysis) uses class labels to find directions that maximize the distance between classes. PCA ignores labels — it maximizes variance without regard to which class data belongs to. Use LDA for classification preprocessing, PCA for unsupervised exploration.
Q6. A PCA component has these loadings: [+0.9, +0.85, +0.88]. What does this tell you?
The features are independent of each other
All three features contribute positively and roughly equally — this PC captures their shared variation
Only the first feature matters
The data needs more standardization
When all loadings are large and positive (≈ same magnitude), the principal component represents the "average" or "overall level" of all three features together. This typically means the features are all measuring the same underlying construct.
Answer the questions above to see your score
PCA Complete Guide
Data Science Series · MSDE Program
Dimensionality Reduction Linear Algebra scikit-learn