Dimensionality Reduction

Overview

Dimensionality reduction is the process of reducing the number of random variables or attributes under consideration, via obtaining a set of principal variables. It is a crucial technique in machine learning and data analysis, used to address the "curse of dimensionality", reduce computational complexity, remove redundant features, and enable data visualization.

There are two main approaches to dimensionality reduction:

Feature Selection: Selects a subset of the original features. Examples include filter methods (e.g., chi-squared, ANOVA), wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g., LASSO regression).
Feature Extraction (or Feature Projection): Creates new features by combining the original features. These new features are typically linear or non-linear combinations of the original ones. Examples include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-distributed Stochastic Neighbor Embedding (t-SNE).

Benefits of dimensionality reduction include:

Reduced storage space and computational time.
Improved model performance by removing noise and redundancy.
Better data visualization when reducing to 2D or 3D.

Core Concepts

Implementation

Principal Component Analysis (PCA) with scikit-learn

▶


import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Standardize the features (important for PCA)
# PCA is affected by scale so you need to scale the features before applying PCA.
X_scaled = StandardScaler().fit_transform(X)

# Apply PCA
# n_components can be an integer (number of components) or a float (variance to retain)
# Here, we reduce from 4 dimensions to 2 dimensions
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_scaled)

# Create a DataFrame with the principal components
pca_df = pd.DataFrame(data=principal_components, columns=['Principal Component 1', 'Principal Component 2'])
final_df = pd.concat([pca_df, pd.DataFrame(y, columns=['target'])], axis=1)

# Explained variance ratio
# This tells us how much variance is captured by each principal component
print("Explained variance ratio by component:", pca.explained_variance_ratio_)
print(f"Total explained variance by 2 components: {sum(pca.explained_variance_ratio_)*100:.2f}%")

# Visualize the 2D PCA results
plt.figure(figsize=(10, 7))
targets = iris.target_names
colors = ['r', 'g', 'b']

for target_val, color in zip(range(len(targets)), colors):
    indices_to_keep = final_df['target'] == target_val
    plt.scatter(final_df.loc[indices_to_keep, 'Principal Component 1'],
                final_df.loc[indices_to_keep, 'Principal Component 2'],
                c=color,
                s=50,
                label=targets[target_val])

plt.title('PCA of Iris Dataset (2 Components)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True)
# plt.show()

print("
Shape of original data:", X_scaled.shape)
print("Shape of data after PCA:", principal_components.shape)

Interview Examples

PCA vs. LDA

What are the main differences between Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA)? When would you prefer one over the other?

Practice Questions

1. What are the practical applications of Dimensionality Reduction? Medium

Hint: Consider both academic and industry use cases

2. Explain the core concepts of Dimensionality Reduction Easy

Hint: Think about the fundamental principles

3. How would you implement this in a production environment? Hard

Hint: Consider scalability and efficiency

Dimensionality Reduction

Overview

Core Concepts

Implementation

Principal Component Analysis (PCA) with scikit-learn

Interview Examples

PCA vs. LDA

Practice Questions

1. What are the practical applications of Dimensionality Reduction? Medium

2. Explain the core concepts of Dimensionality Reduction Easy

3. How would you implement this in a production environment? Hard

Related Resources

Related Topics