The cause of poor performance in machine learning is either overfitting or underfitting the data.

Bạn đang xem: What is overfitting?

In this post, you will discover the concept of generalization in machine learning and the problems of overfitting và underfitting that go along with it.

Kick-start your project with my new book Master Machine Learning Algorithms, including step-by-step tutorials & the Excel Spreadsheet files for all examples.

Let’s get started.


*

Overfitting and Underfitting With Machine Learning AlgorithmsPhoto by Ian Carroll, some rights reserved.


Approximate a Target Function in Machine Learning

Supervised machine learning is best understood as approximating a target function (f) that maps đầu vào variables (X) khổng lồ an output đầu ra variable (Y).

Y = f(X)

This characterization describes the range of classification & prediction problems & the machine algorithms that can be used to lớn address them.

An important consideration in learning the target function from the training data is how well the mã sản phẩm generalizes khổng lồ new data. Generalization is important because the data we collect is only a sample, it is incomplete và noisy.

Get your không tính phí Algorithms Mind Map


*

Sample of the handy machine learning algorithms mind map.


I"ve created a handy mind maps of 60+ algorithms organized by type.

Download it, print it & use it. 

tải về For không tính tiền

Also get exclusive access lớn the machine learning algorithms email mini-course.

Generalization in Machine Learning

In machine learning we describe the learning of the target function from training data as inductive learning.

Induction refers to lớn learning general concepts from specific examples which is exactly the problem that supervised machine learning problems aim to solve. This is different from deduction that is the other way around & seeks to lớn learn specific concepts from general rules.

Generalization refers khổng lồ how well the concepts learned by a machine learning mã sản phẩm apply khổng lồ specific examples not seen by the mã sản phẩm when it was learning.

The goal of a good machine learning mã sản phẩm is lớn generalize well from the training data to lớn any data from the problem domain. This allows us to make predictions in the future on data the model has never seen.

There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes lớn new data, namely overfitting và underfitting.

Overfitting và underfitting are the two biggest causes for poor performance of machine learning algorithms.

Statistical Fit

In statistics, a fit refers khổng lồ how well you approximate a target function.

This is good terminology khổng lồ use in machine learning, because supervised machine learning algorithms seek to lớn approximate the unknown underlying mapping function for the output variables given the đầu vào variables.

Statistics often describe the goodness of fit which refers lớn measures used to lớn estimate how well the approximation of the function matches the target function.

Some of these methods are useful in machine learning (e.g. Calculating the residual errors), but some of these techniques assume we know the size of the target function we are approximating, which is not the case in machine learning.

If we knew the khung of the target function, we would use it directly to lớn make predictions, rather than trying khổng lồ learn an approximation from samples of noisy training data.

Overfitting in Machine Learning

Overfitting refers khổng lồ a model that models the training data too well.

Overfitting happens when a model learns the detail and noise in the training data lớn the extent that it negatively impacts the performance of the mã sản phẩm on new data. This means that the noise or random fluctuations in the training data is picked up và learned as concepts by the model. The problem is that these concepts vì not apply lớn new data và negatively impact the models ability khổng lồ generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit & constrain how much detail the mã sản phẩm learns.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible và is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.

Underfitting in Machine Learning

Underfitting refers khổng lồ a mã sản phẩm that can neither mã sản phẩm the training data nor generalize khổng lồ new data.

An underfit machine learning mã sản phẩm is not a suitable model and will be obvious as it will have poor performance on the training data.

Underfitting is often not discussed as it is easy to lớn detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.

A Good Fit in Machine Learning

Ideally, you want khổng lồ select a model at the sweet spot between underfitting & overfitting.

This is the goal, but is very difficult to do in practice.

To understand this goal, we can look at the performance of a machine learning algorithm over time as it is learning a training data. We can plot both the skill on the training data và the skill on a demo dataset we have held back from the training process.

Over time, as the algorithm learns, the error for the model on the training data goes down và so does the error on the chạy thử dataset. If we train for too long, the performance on the training dataset may continue lớn decrease because the model is overfitting & learning the irrelevant detail and noise in the training dataset. At the same time the error for the kiểm tra set starts lớn rise again as the model’s ability to lớn generalize decreases.

The sweet spot is the point just before the error on the kiểm tra dataset starts to lớn increase where the model has good skill on both the training dataset & the unseen demo dataset.

You can perform this experiment with your favorite machine learning algorithms. This is often not useful technique in practice, because by choosing the stopping point for training using the skill on the thử nghiệm dataset it means that the testset is no longer “unseen” or a standalone objective measure. Some knowledge (a lot of useful knowledge) about that data has leaked into the training procedure.

There are two additional techniques you can use to help find the sweet spot in practice: resampling methods and a validation dataset.

How to Limit Overfitting

Both overfitting và underfitting can lead khổng lồ poor mã sản phẩm performance. But by far the most common problem in applied machine learning is overfitting.

Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data.

There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting:

Use a resampling technique to estimate model accuracy.Hold back a validation dataset.

The most popular resampling technique is k-fold cross validation. It allows you to train & test your model k-times on different subsets of training data và build up an estimate of the performance of a machine learning model on unseen data.

A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very over of your project. After you have selected & tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to lớn get a final objective idea of how the models might perform on unseen data.

Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data. If you have the data, using a validation dataset is also an excellent practice.

Further Reading

This section lists some recommended resources if you are looking khổng lồ learn more about generalization, overfitting và underfitting in machine learning.

Summary

In this post, you discovered that machine learning is solving problems by the method of induction.

You learned that generalization is a mô tả tìm kiếm of how well the concepts learned by a model apply to lớn new data. Finally, you learned about the terminology of generalization in machine learning of overfitting & underfitting:

Overfitting: Good performance on the training data, poor generliazation lớn other data.

Xem thêm: Ví Điện Tử Vnpt Epay Là Gì, Ví Vnpt Epay Có Những Lợi Ích Gì

Underfitting: Poor performance on the training data và poor generalization khổng lồ other data

Do you have any questions about overfitting, underfitting or this post? Leave a bình luận and ask your question và I will vì chưng my best to lớn answer it.