Complete Learning Guide · Data Science

Principal
Component
Analysis

A complete, in-depth guide to understanding PCA from first principles — covering the math, the intuition, Python implementation, and real applications.

Dimensionality Reduction Unsupervised Learning Linear Algebra Feature Engineering scikit-learn

Foundations · 01

What is PCA?

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a dataset with many correlated features into a smaller set of uncorrelated variables called principal components.

Each principal component is a linear combination of the original features, ordered by the amount of variance it explains — PC1 always explains the most variance, PC2 the second most, and so on.

Core Idea

PCA finds the directions (axes) in your feature space along which the data varies the most, then projects your data onto those directions. The result is a new, compressed representation that keeps as much information as possible.

1901

Invented by Karl Pearson

Linear

Transformation type

SVD

Computed via Singular Value Decomposition

Foundations · 02

Why use PCA?

Modern datasets often have hundreds or thousands of features. This creates several problems that PCA is specifically designed to solve.

⚡

Curse of Dimensionality

As features increase, the data becomes increasingly sparse. ML models need exponentially more data to learn patterns, and they tend to overfit.

🔗

Multicollinearity

Correlated features carry redundant information. For example, "height in cm" and "height in inches" say the same thing. PCA merges them into one component.

👁

Visualization

Humans can only visualize 2D or 3D. PCA reduces any dataset to 2–3 components so you can actually see the structure — clusters, outliers, patterns.

🚀

Speed

Training a model on 50 features vs 500 features can be 10–100× faster. PCA removes the noise and redundancy before training.

Foundations · 03

The Intuition

📸 The Photography Analogy

When you take a photo of a 3D object, you project it onto a 2D surface (the photo). You lose one dimension, but if you take the photo from the right angle, you capture the most important visual information. PCA works exactly like this — it finds the "best angle" to project your data, preserving as much variance (information) as possible.

🔦 The Shadow Analogy

Shine a flashlight on a 3D object and observe the shadow. The shadow is 2D but its shape depends on the angle of light. PCA finds the angle that casts the largest, most informative shadow — the projection that preserves the most spread (variance) in your data.

📐 Rotating the Axes

PCA is essentially a rotation of your coordinate system. Instead of the original X, Y, Z axes, PCA creates new axes (PC1, PC2, PC3...) that are aligned with the directions of maximum variance in your data. The first new axis (PC1) points in the direction where the data spreads out the most.

Simple Example: Height & Weight

Suppose you have data on people's height and weight. These two features are correlated — taller people tend to weigh more.

What PCA does

PC1 (main direction) — a diagonal axis from "short & light" to "tall & heavy". This single axis captures most of the body size information from both height and weight.

PC2 (perpendicular) — captures the remaining variation: people who deviate from the height-weight trend (muscular, etc.). Much less important.

Result: You compress 2 features into 1 (PC1) while keeping ~95% of the information. You traded a little accuracy for a huge simplification.

Mathematics · Step 1

Standardize the Data

Before applying PCA, we must standardize each feature so that it has mean = 0 and standard deviation = 1. This is called z-score normalization.

z = (x − μ) / σ

where μ is the feature mean and σ is the standard deviation.

⚠️ Why this matters

Without standardization, features with large scales dominate the analysis. For example, if you have "salary in USD" (range: 30,000–200,000) and "age" (range: 20–65), PCA will treat salary as ~3,000× more important than age just because of units. Standardization puts all features on equal footing.

Example: Student Data

Student	Math (raw)	Science (raw)	Hours (raw)	Math (std)	Science (std)	Hours (std)
Alice	85	82	8	+0.30	+0.22	+0.41
Bob	60	65	5	−1.39	−1.24	−1.23
Carol	92	90	10	+1.39	+1.35	+1.64
Dave	70	68	6	−0.69	−0.79	−0.41
Eve	78	75	7	0.00	0.00	0.00

After standardization, all three features have mean=0 and std=1. Now PCA treats them equally.

from sklearn.preprocessing import StandardScaler
import numpy as np

data = {
    'math':    [85, 60, 92, 70, 78],
    'science': [82, 65, 90, 68, 75],
    'hours':   [8,  5,  10, 6,  7 ]
}

scaler = StandardScaler()
X_scaled = scaler.fit_transform(pd.DataFrame(data))

# X_scaled[:,0].mean() ≈ 0.0   X_scaled[:,0].std() ≈ 1.0
print(f"Mean: {X_scaled.mean(axis=0).round(10)}")   # [0. 0. 0.]
print(f"Std:  {X_scaled.std(axis=0).round(2)}")     # [1. 1. 1.]

Mathematics · Step 2

Compute the Covariance Matrix

The covariance matrix tells us how much each pair of features varies together. It is an n×n symmetric matrix where n is the number of features.

Cov(X,Y) = Σ [(xᵢ − x̄)(yᵢ − ȳ)] / (n−1)

The diagonal contains the variance of each feature. Off-diagonal elements show how two features move together:

Positive covariance → features rise and fall together
Negative covariance → one rises when the other falls
Near zero → features are largely independent

Covariance Matrix — Student Example (3×3)

Color intensity = strength of correlation. Darker blue = stronger positive relationship.

	Math	Science	Hours
Math	1.00	0.97	0.98
Science	0.97	1.00	0.96
Hours	0.98	0.96	1.00

Interpretation

All correlations are close to 0.97–0.98. This means Math, Science, and Study Hours are almost perfectly correlated — they all measure the same underlying thing: academic effort and ability. PCA will compress all three into a single component.

X_cov = np.cov(X_scaled.T)  # transpose: features as rows

print(X_cov.round(2))
# [[1.   0.97 0.98]
#  [0.97 1.   0.96]
#  [0.98 0.96 1.  ]]

Mathematics · Step 3

Eigenvalues & Eigenvectors

This is the mathematical core of PCA. We decompose the covariance matrix to find its eigenvalues and eigenvectors.

Eigenvectors

The New Axes (Directions)

Eigenvectors define the directions of the new feature space — the principal components. They are always perpendicular (orthogonal) to each other.

Each eigenvector is a unit vector (length = 1) pointing in the direction of maximum variance, given that previous directions have already been accounted for.

Eigenvalues

The Importance (Magnitude)

Eigenvalues tell us how much variance each eigenvector (principal component) captures. Larger eigenvalue = more important component.

The sum of all eigenvalues = total variance in the dataset. So each eigenvalue / total = percentage of variance explained.

A · v = λ · v

where A = covariance matrix, v = eigenvector, λ = eigenvalue

Eigenvalues for Student Dataset

PC1

λ=2.94 98.0%

PC2

λ=0.05 99.6%

PC3

λ=0.01 100%

Key Insight

PC1 alone captures 98% of all variance! This means we can reduce 3 features to just 1 component and keep almost all information. The reason: Math, Science, and Hours are so correlated they basically measure one thing.

eigenvalues, eigenvectors = np.linalg.eigh(X_cov)

# Sort by eigenvalue descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues  = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

variance_explained = eigenvalues / eigenvalues.sum()
cumulative         = np.cumsum(variance_explained)

print("Eigenvalues:",         eigenvalues.round(3))  # [2.94  0.05  0.01]
print("Variance explained:",  variance_explained.round(3))
print("Cumulative variance:", cumulative.round(3))

Mathematics · Step 4

Selecting k Components

After computing the eigenvalues, we choose how many principal components (k) to keep. This is one of the most important decisions in PCA.

Three Rules for Choosing k

Variance Threshold Rule (most common)

Keep enough components to explain at least 85–95% of cumulative variance. The threshold depends on your use case — visualization needs 2–3 components; models can tolerate 85%.

Σλᵢ / Σλ_all ≥ 0.95 → choose k where cumulative ≥ 95%

Kaiser Rule (eigenvalue ≥ 1)

Keep components with eigenvalue ≥ 1.0. The logic: if a component explains less variance than a single original feature (variance = 1 after standardization), it's not worth keeping.

Scree Plot (visual elbow method)

Plot the eigenvalues in descending order and look for the "elbow" — the point where they stop dropping sharply. Keep components before the elbow.

98%

PC1

1.6

PC2

0.4

PC3

Clear elbow after PC1 — keep only 1 component for this dataset.

from sklearn.decomposition import PCA

# Option A: specify exact number of components
pca = PCA(n_components=2)

# Option B: specify variance threshold (auto-selects k)
pca = PCA(n_components=0.95)  # keep 95% of variance

pca.fit(X_scaled)
print(f"Components kept: {pca.n_components_}")
print(pca.explained_variance_ratio_)  # per-component %
print(pca.explained_variance_ratio_.cumsum())  # cumulative %

Mathematics · Step 5

Transform the Data

Once we've selected k components, we project the original data onto the new principal component axes. This is a simple matrix multiplication.

X_new = X_scaled · W

where W is the matrix of k eigenvectors (shape: n_features × k)

The result is a new dataset with k columns instead of the original n columns, where each column is uncorrelated with all others.

Transformed Student Data (k=1)

Student	Original Data (3 features)	PC1 Score (1 feature)	Interpretation
Alice	85, 82, 8	+0.54	Above average
Bob	60, 65, 5	−2.45	Well below average
Carol	92, 90, 10	+2.46	Top performer
Dave	70, 68, 6	−1.22	Below average
Eve	78, 75, 7	0.00	Exactly average

Result

We compressed 3 features into 1 number (PC1) that represents overall academic performance, retaining 98% of the original information. Carol is clearly the top performer, Bob is the weakest — immediately visible from one number.

To reconstruct the original data (approximately), use inverse_transform:

pca = PCA(n_components=1)
X_pca = pca.fit_transform(X_scaled)        # compress: (5,3) → (5,1)
X_back = pca.inverse_transform(X_pca)     # reconstruct: (5,1) → (5,3)

# X_back ≈ X_scaled (not exact — we lost the 2% in PC2+PC3)
reconstruction_error = np.mean((X_scaled - X_back)**2)
print(f"Reconstruction MSE: {reconstruction_error:.4f}")

In Practice · Code

Complete Python Implementation

Here is a complete, production-ready PCA implementation using scikit-learn with all best practices.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

# ─── 1. Load and prepare data ───────────────────────────────
df = pd.read_csv('dataset.csv')
X = df.drop('target', axis=1)

# ─── 2. Build PCA pipeline ──────────────────────────────────
pipeline = Pipeline([
    ('scaler', StandardScaler()),        # always standardize first!
    ('pca',    PCA(n_components=0.95))  # keep 95% variance
])

X_pca = pipeline.fit_transform(X)
pca   = pipeline.named_steps['pca']

# ─── 3. Examine results ─────────────────────────────────────
k = pca.n_components_
print(f"Original features:  {X.shape[1]}")
print(f"PCA components kept: {k}")
print(f"Variance retained:  {pca.explained_variance_ratio_.sum():.1%}")

# ─── 4. Scree plot ───────────────────────────────────────────
plt.figure(figsize=(8, 4))
plt.bar(range(1, k+1), pca.explained_variance_ratio_, color='#4f8ef7')
plt.plot(range(1, k+1), pca.explained_variance_ratio_.cumsum(), 'r--')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot')
plt.show()

# ─── 5. Use in ML model ─────────────────────────────────────
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

full_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca',    PCA(n_components=0.95)),
    ('clf',    LogisticRegression())
])

scores = cross_val_score(full_pipeline, X, df['target'], cv=5)
print(f"CV Accuracy: {scores.mean():.1%} ± {scores.std():.1%}")

In Practice · Decision

How to Choose k (Explained Variance)

The right value of k depends on your goals. Here is a practical decision guide:

Use Case	Recommended k	Variance to keep	Reason
2D Visualization	k = 2	whatever it gives	Need exactly 2 for a scatter plot
3D Visualization	k = 3	whatever it gives	Need exactly 3 for 3D plot
ML preprocessing	auto (95% rule)	≥ 95%	Balance: info retention vs speed
Noise removal	auto (85% rule)	≥ 85%	Aggressively remove low-variance noise
Compression	depends on quality	80–90%	Trade quality for size/speed
Feature interpretation	small (3–5)	as much as possible	Human-interpretable components

Important Warning

Using PCA with n_components=0.99 is NOT always better than 0.95. The last 4% of variance often contains mostly noise. Keeping noise hurts model performance. Always validate on a test set!

In Practice · Interpretation

Understanding Loadings

PCA components are linear combinations of original features. The coefficients are called loadings (or component weights). They tell you which original features each principal component is "made of".

PC1 = w₁·feature₁ + w₂·feature₂ + ... + wₙ·featureₙ

Example: Student Loadings

Feature	PC1 Loading	PC2 Loading	Interpretation
Math Score	+0.578	−0.401	Math drives PC1 strongly
Science Score	+0.575	−0.448	Science drives PC1 similarly
Study Hours	+0.579	+0.800	Hours contributes equally to PC1, but uniquely to PC2

Reading the Loadings

PC1: All loadings ≈ +0.578. This means PC1 is the "average" of all three subjects — it captures overall academic performance.

PC2: Study hours loads very differently (+0.80) vs Math/Science (-0.40). PC2 represents students who study a lot but score below expectations — an "effort vs outcome" dimension.

pca = PCA(n_components=2)
pca.fit(X_scaled)

# pca.components_ shape: (n_components, n_features)
loadings = pd.DataFrame(
    pca.components_.T,
    index   = ['math', 'science', 'hours'],
    columns = ['PC1', 'PC2']
)

print(loadings.round(3))
#          PC1    PC2
# math     0.578 -0.401
# science  0.575 -0.448
# hours    0.579  0.800

In Practice · Evaluation

Advantages & Disadvantages

ADVANTAGES

✓Reduces overfitting by removing noisy features

✓Dramatically speeds up model training

✓Removes multicollinearity between features

✓Enables visualization of high-dimensional data

✓Can improve model accuracy by removing noise

✓Unsupervised — no labels needed

✓Well-studied with strong theoretical guarantees

DISADVANTAGES

✗Components are hard to interpret (not original features)

✗Always loses some information (information loss)

✗Assumes linear relationships only

✗Sensitive to outliers (use robust PCA if needed)

✗Requires standardization before use

✗Choosing k is subjective

✗Feature importance is lost after transformation

Applications · 01

When to Use PCA

Features > 50

High-dimensional data

When you have many features and suspect many are correlated. Common in genomics, image processing, text analysis.

High correlation

Correlated features

When a correlation matrix shows many values above 0.7–0.8, PCA will compress those into fewer dimensions effectively.

Visualization

Exploring data patterns

When you want to see if natural clusters exist. Reduce to 2D and plot — you'll see groupings immediately.

Overfitting

Model regularization

When your model overfits and adding more data isn't an option. PCA reduces the input space so the model generalizes better.

Interpretability

Need to know which features matter

If you need to report "feature X is important", PCA destroys that. Use LASSO regression or feature importance from tree models instead.

Few features

Small feature set (<10)

PCA adds complexity without much benefit. If you only have 5–10 features, just use them directly.

Non-linear

Non-linear relationships

PCA only finds linear patterns. Use Kernel PCA, t-SNE, or UMAP if the important structures in your data are non-linear.

Classification

Want class-aware reduction

PCA ignores class labels. Use Linear Discriminant Analysis (LDA) instead — it specifically maximizes class separation.

Applications · 02

PCA Variants & Alternatives

Standard PCA has limitations. Here are specialized variants for different situations.

Method	When to use	Key difference from PCA
Kernel PCA	Non-linear data structure	Uses kernel trick to find non-linear components
Sparse PCA	Need interpretable components	Forces most loadings to be zero (sparse)
Incremental PCA	Huge datasets that don't fit in RAM	Processes data in mini-batches
Randomized PCA	Very high-dimensional, speed critical	Approximate SVD — much faster, slight accuracy loss
Robust PCA	Data with many outliers	Separates low-rank + sparse components
TruncatedSVD / LSA	Sparse text/NLP data	Works directly on sparse matrices (no centering)
LDA	Classification with class labels	Maximizes class separability, not variance
t-SNE	2D/3D visualization of clusters	Non-linear, preserves local structure, not for ML features
UMAP	Large datasets, visualization, features	Non-linear, preserves global + local structure, faster than t-SNE
Autoencoder	Complex non-linear compression	Deep learning version — learns complex representations

Quick Decision Guide

Linear + unsupervised + fast? → Standard PCA
Non-linear structure? → Kernel PCA or UMAP
Have class labels? → LDA
Visualization only? → t-SNE or UMAP
Text / NLP? → TruncatedSVD (LSA)
Outliers in data? → Robust PCA

Reference · Glossary

Key Terms

Principal Component (PC)

A new axis/feature created by PCA. Each PC is a linear combination of all original features, ordered by variance explained.

Eigenvalue (λ)

A scalar that tells how much variance a principal component captures. Sum of all eigenvalues = total variance in the data.

Eigenvector

A vector that defines the direction of a principal component. Always orthogonal (perpendicular) to other eigenvectors.

Covariance Matrix

An n×n symmetric matrix showing how much each pair of features varies together. PCA computes eigenvectors of this matrix.

Explained Variance Ratio

The fraction of total variance captured by each component: λᵢ / Σλ. Used to decide how many components to keep.

Loadings

The coefficients (weights) that tell how much each original feature contributes to each principal component. Stored in pca.components_.

Scree Plot

A bar/line chart of eigenvalues in descending order. The "elbow" point suggests the optimal number of components k.

Dimensionality Reduction

The process of reducing the number of features while retaining as much information as possible.

Curse of Dimensionality

As features increase, data becomes exponentially sparse, making ML harder. PCA directly addresses this problem.

Multicollinearity

When two or more features are highly correlated. They carry redundant information that PCA compresses into fewer dimensions.

Standardization (Z-score)

Transform features to mean=0, std=1. Required before PCA so large-scale features don't dominate small-scale ones.

SVD (Singular Value Decomposition)

The linear algebra algorithm used to efficiently compute PCA. sklearn's PCA uses SVD internally.

Reconstruction Error

How much information was lost after PCA compression. Measured as MSE between original and reconstructed data.

Biplot

A visualization that shows both PCA scores (data points) and loadings (arrows for each original feature) in the same plot.

Self-Test · Quiz

Test Your Understanding

Click an answer to check if you are right. Explanations appear after answering.

Q1. What does PCA try to maximize when creating principal components?

Mean of the data

Variance in the projected direction

Correlation between features

Number of data points

PCA finds the direction (eigenvector) along which the data spreads out the most — i.e., maximizing variance. This ensures the projection retains as much information as possible.

Q2. Why must we standardize data BEFORE applying PCA?

To make the data normally distributed

To reduce the number of features

So that features with large scales don't dominate the analysis

To remove outliers from the data

Without standardization, a feature with values 0–100,000 would dominate a feature with values 0–1, simply because of units. Standardization (mean=0, std=1) puts all features on equal footing.

Q3. You apply PCA and get eigenvalues: [3.6, 1.2, 0.4, 0.1, 0.08]. How many components should you keep using the Kaiser rule?

1 component

2 components

3 components

5 components (all)

The Kaiser rule says keep components where eigenvalue ≥ 1.0. Here, only PC1 (3.6) and PC2 (1.2) qualify. PC3 (0.4) is below 1.0, so we stop at k=2.

Q4. After PCA, principal components are always:

Identical to the original features

Highly correlated with each other

Uncorrelated (orthogonal) with each other

Always fewer than 3 dimensions

A key property of PCA is that all principal components are orthogonal (perpendicular) to each other — they are completely uncorrelated. This is by mathematical design (eigenvectors of a symmetric matrix are orthogonal).

Q5. When would you use LDA instead of PCA?

When you have too many features

When you have class labels and want to maximize class separation

When the data has outliers

When the data is non-linear

LDA (Linear Discriminant Analysis) uses class labels to find directions that maximize the distance between classes. PCA ignores labels — it maximizes variance without regard to which class data belongs to. Use LDA for classification preprocessing, PCA for unsupervised exploration.

Q6. A PCA component has these loadings: [+0.9, +0.85, +0.88]. What does this tell you?

The features are independent of each other

All three features contribute positively and roughly equally — this PC captures their shared variation

Only the first feature matters

The data needs more standardization

When all loadings are large and positive (≈ same magnitude), the principal component represents the "average" or "overall level" of all three features together. This typically means the features are all measuring the same underlying construct.

Answer the questions above to see your score

PCA Complete Guide

Data Science Series · MSDE Program

Dimensionality Reduction Linear Algebra scikit-learn

PrincipalComponentAnalysis

What is PCA?

Why use PCA?

Curse of Dimensionality

Multicollinearity

Visualization

Speed

The Intuition

Simple Example: Height & Weight

Standardize the Data

Example: Student Data

Compute the Covariance Matrix

Covariance Matrix — Student Example (3×3)

Eigenvalues & Eigenvectors

The New Axes (Directions)

The Importance (Magnitude)

Eigenvalues for Student Dataset

Selecting k Components

Three Rules for Choosing k

Variance Threshold Rule (most common)

Kaiser Rule (eigenvalue ≥ 1)

Scree Plot (visual elbow method)

Transform the Data

Transformed Student Data (k=1)

Complete Python Implementation

How to Choose k (Explained Variance)

Understanding Loadings

Example: Student Loadings

Advantages & Disadvantages

When to Use PCA

High-dimensional data

Correlated features

Exploring data patterns

Model regularization

Need to know which features matter

Small feature set (<10)

Non-linear relationships

Want class-aware reduction

PCA Variants & Alternatives

Key Terms

Test Your Understanding

Principal
Component
Analysis