Machine learning

Content:

Introduction

1.1 What Is machine Learning
1.2 When Do We Need Machine Learning
1.3 Types of Learning
1.4 Relations to Other Fields

1.5 How to Read This Book

1.5.1 Possible Course Plans Based on This Book

1.6 Notation

Part I Foundations

2 A Gentle Start

2.1 A Formal Model { The Statistical Learning Framework

2.2 Empirical Risk Minimization

2.2.1 Something May Go Wrong

2.3 Empirical Risk Minimization with Inductive Bias

2.3.1 Finite Hypothesis Classes

3 A Formal Learning Model

3.1 PAC Learning

3.2 A More General Learning Model

3.2.1 Releasing the Realizability Assumption

3.2.2 The Scope of Learning Problems Modeled

4 Learning via Uniform Convergence

4.1 Uniform Convergence Is Su cient for Learnability

4.2 Finite Classes Are Agnostic PAC Learnable

The Bias-Complexity Tradeo

5.1 The No-Free-Lunch Theorem

5.1.1 No-Free-Lunch and Prior Knowledge

5.2 Error Decomposition

6 The VC-Dimension

6.1 In nite-Size Classes Can Be Learnable

6.2 The VC-Dimension

6.3 Threshold Functions

6.3.1 Intervals

6.3.2 Axis Aligned Rectangles

6.3.3 Finite Classes

6.3.4 VC-Dimension and the Number of Parameters

6.4 The Fundamental Theorem of PAC learning

6.5 Proof of Theorem

6.5.1 Sauer's Lemma and the Growth Function

6.5.2 Uniform Convergence for Classes of Small E ective Size

7 Nonuniform Learnability

7.1 Nonuniform Learnability

7.1.1 Characterizing Nonuniform Learnability

7.2 Structural Risk Minimization

7.3 Minimum Description Length and Occam's Razor

7.3.1 Occam's Razor

7.4 Other Notions of Learnability { Consistency

7.5 Discussing the Di erent Notions of Learnability

7.5.1 The No-Free-Lunch Theorem Revisited

8.1 Formal De nition*

8.2 Implementing the ERM Rule

8.2.1 Finite Classes

8.2.2 Axis Aligned Rectangles

8.2.3 Boolean Conjunctions

8.2.4 Learning 3-Term DNF

8.3 E ciently Learnable, but Not by a Proper ERM

8.4 Hardness of Learning*

Part II From Theory to Algorithms

9 Linear Predictors

9.1 Halfspaces

9.1.1 Linear Programming for the Class of Halfspaces

9.1.2 Perceptron for Halfspaces

9.1.3 The VC Dimension of Halfspaces

9.2 Linear Regression

9.2.1 Least Squares

9.2.2 Linear Regression for Polynomial Regression Tasks

9.3 Logistic Regression

10 Boosting

10.1 Weak Learnability

10.1.1 E cient Implementation of ERM for Decision Stumps

10.2 AdaBoost

10.3 Linear Combinations of Base Hypotheses

10.3.1 The VC-Dimension of L(B; T)

10.4 AdaBoost for Face Recognition

11 Model Selection and Validation

11.1 Model Selection Using SRM

11.2 Validation

11.2.1 Hold Out Set

11.2.2 Validation for Model Selection

11.2.3 The Model-Selection Curve

11.2.4 k-Fold Cross Validation]

11.2.5 Train-Validation-Test Split

11.3 What to Do If Learning Fails

12 Convex Learning Problems

12.1 Convexity, Lipschitzness, and Smoothness

12.1.1 Convexity

12.1.2 Lipschitzness

12.1.3 Smoothness

12.2 Convex Learning Problems

12.2.1 Learnability of Convex Learning Problems

12.2.2 Convex-Lipschitz/Smooth-Bounded Learning Problems

12.3 Surrogate Loss Functions

13 Regularization and Stability

13.1 Regularized Loss Minimization

13.1.1 Ridge Regression

13.2 Stable Rules Do Not Over t

13.3 Tikhonov Regularization as a Stabilizer

13.3.1 Lipschitz Loss

13.3.2 Smooth and Nonnegative Loss

13.4 Controlling the Fitting-Stability Tradeo

14 Stochastic Gradient Descent

14.1 Gradient Descent

14.1.1 Analysis of GD for Convex-Lipschitz Functions

14.2 Subgradients

14.2.1 Calculating Subgradients

14.2.2 Subgradients of Lipschitz Functions

14.2.3 Subgradient Descent

14.3 Stochastic Gradient Descent (SGD)

14.3.1 Analysis of SGD for Convex-Lipschitz-Bounded Functions

14.4 Variants

14.4.1 Adding a Projection Step

14.4.2 Variable Step Size

14.4.3 Other Averaging Techniques

14.4.4 Strongly Convex Functions*

14.5 Learning with SGD

14.5.1 SGD for Risk Minimization

14.5.2 Analyzing SGD for Convex-Smooth Learning Problems

14.5.3 SGD for Regularized Loss Minimization

15 Support Vector Machines

15.1 Margin and Hard-SVM

15.1.1 The Homogenous Case

15.1.2 The Sample Complexity of Hard-SVM

15.2 Soft-SVM and Norm Regularization

15.2.1 The Sample Complexity of Soft-SVM

15.2.2 Margin and Norm-Based Bounds versus Dimension

15.2.3 The Ramp Loss* 209

15.3 Optimality Conditions and \Support Vectors"*

15.4 Duality*

15.5 Implementing Soft-SVM Using SGD

16 Kernel Methods

16.1 Embeddings into Feature Spaces

16.2 The Kernel Trick

16.2.1 Kernels as a Way to Express Prior Knowledge

16.2.2 Characterizing Kernel Functions*

16.3 Implementing Soft-SVM with Kernels

17 Multiclass, Ranking, and Complex Prediction Problems

17.1 One-versus-All and All-Pairs

17.2 Linear Multiclass Predictors

17.2.1 How to Construct

17.2.2 Cost-Sensitive Classi cation

17.2.3 ERM

17.2.4 Generalized Hinge Loss

17.2.5 Multiclass SVM and SGD

17.3 Structured Output Prediction

17.4 Ranking

17.4.1 Linear Predictors for Ranking

17.5 Bipartite Ranking and Multivariate Performance Measures

17.5.1 Linear Predictors for Bipartite Ranking

18 Decision Trees

18.1 Sample Complexity

18.2 Decision Tree Algorithms

18.2.1 Implementations of the Gain Measure

18.2.2 Pruning

18.2.3 Threshold-Based Splitting Rules for Real-Valued Features

18.3 Random Forests

19 Nearest Neighbor

19.1 k Nearest Neighbors

19.2 Analysis

19.2.1 A Generalization Bound for the 1-NN Rule

19.2.2 The \Curse of Dimensionality"

19.3 E cient Implementation*

20 Neural Networks

20.1 Feedforward Neural Networks

20.2 Learning Neural Networks

20.3 The Expressive Power of Neural Networks

20.3.1 Geometric Intuition

20.4 The Sample Complexity of Neural Networks

20.5 The Runtime of Learning Neural Networks

20.6 SGD and Backpropagation

Part III Additional Learning Models

21 Online Learning

21.1 Online Classi cation in the Realizable Case

21.1.1 Online Learnability

21.2 Online Classi cation in the Unrealizable Case

21.2.1 Weighted-Majority

21.3 Online Convex Optimization

21.4 The Online Perceptron Algorithm

22 Clustering

22.1 Linkage-Based Clustering Algorithms

22.2 k-Means and Other Cost Minimization Clusterings

22.2.1 The k-Means Algorithm

22.3 Spectral Clustering

22.3.1 Graph Cut

22.3.2 Graph Laplacian and Relaxed Graph Cuts

22.3.3 Unnormalized Spectral Clustering

22.4 Information Bottleneck*

22.5 A High Level View of Clustering

23 Dimensionality Reduction

23.1 Principal Component Analysis (PCA)

23.1.1 A More E cient Solution for the Case d m

23.1.2 Implementation and Demonstration

23.2 Random Projections

23.3 Compressed Sensing

23.3.1 Proofs*

23.4 PCA or Compressed Sensing?

24 Generative Models

24.1 Maximum Likelihood Estimator

24.1.1 Maximum Likelihood Estimation for Continuous Random

Variables

24.1.2 Maximum Likelihood and Empirical Risk Minimization

24.1.3 Generalization Analysis

24.2 Naive Bayes

24.3 Linear Discriminant Analysis

24.4 Latent Variables and the EM Algorithm

24.4.1 EM as an Alternate Maximization Algorithm

24.4.2 EM for Mixture of Gaussians (Soft k-Means)

24.5 Bayesian Reasoning

25 Feature Selection and Generation

25.1 Feature Selection

25.1.1 Filters

25.1.2 Greedy Selection Approaches

25.1.3 Sparsity-Inducing Norms

25.2 Feature Manipulation and Normalization

25.2.1 Examples of Feature Transformations

25.3 Feature Learning

25.3.1 Dictionary Learning Using Auto-Encoders

Part IV Advanced Theory

26 Rademacher Complexities

26.1 The Rademacher Complexity

26.1.1 Rademacher Calculus

26.2 Rademacher Complexity of Linear Classes

26.3 Generalization Bounds for SVM

26.4 Generalization Bounds for Predictors with Low `1 Norm

27 Covering Numbers

27.1 Covering

27.1.1 Properties

27.2 From Covering to Rademacher Complexity via Chaining

28 Proof of the Fundamental Theorem of Learning Theory

28.1 The Upper Bound for the Agnostic Case

28.2 The Lower Bound for the Agnostic Case

28.2.1 Showing That m( ; ) 0:5 log(1=(4 ))= 2

28.2.2 Showing That m( ; 1=8) 8d= 2

28.3 The Upper Bound for the Realizable Case

28.3.1 From -Nets to PAC Learnability

29 Multiclass Learnability

29.1 The Natarajan Dimension

29.2 The Multiclass Fundamental Theorem

29.2.1 On the Proof of Theorem

29.3 Calculating the Natarajan Dimension

29.3.1 One-versus-All Based Classes

29.3.2 General Multiclass-to-Binary Reductions

29.3.3 Linear Multiclass Predictors

29.4 On Good and Bad ERMs

30 Compression Bounds

30.1 Compression Bounds

30.2 Axis Aligned Rectangles

30.2.1 Halfspaces

30.2.2 Separating Polynomials

30.2.3 Separation with Margin

31 PAC-Bayes

31.1 PAC-Bayes Bounds

The Seven-Step Model of Migration

Irrespective of the migration approach adopted, the Seven-step Model of Cloud Migration creates a more rational point of view towards the migration process and offers the ability to imbibe several best practices throughout the journey Step 1: Assess Cloud migration assessments are conducted to understand the complexities in the migration process at the code, design and architectural levels. The investment and the recurring costs are also evaluated along with gauging the tools, test cases, functionalities and other features related to the configuration. Step 2: Isolate The applications to be migrated to the cloud from the internal data center are freed of dependencies pertaining to the environment and the existing system. This step cuts a clearer picture about the complexity of the migration process. Step 3: Map Most organisations hold a detailed mapping of their environment with all the systems and applications. This information can be used to distinguish between the

Techknow+

Search This Blog