If you read the post about binary classification and decision trees, you may notice that we need features and labels for each training sample. It is required for classification task because the model that we train needs to know the right answers. Only this way it can learn how feature values affect the result of classification.
But when do we really need to have labeled data and which algorithms can work without it?
The approach I described above is called supervised learning - because we supervise the learning process by confronting the response of current model (predicted values) with answers that we expect (true values). An algorithm uses these correct answers to evaluate its accuracy on training data and improve itself.
Labels are necessary in the learning process to be able to map X → y where X are features and y are classes. Can you imagine learning to recognize e.g. fruits only given features such as size, color, texture?
Imagine that you:
- don’t know labels at all, you only see couple of fruits with no names (labels),
- don’t have any reponse (feedback) from the teacher, so even if you pick one of these unlabeled fruits, you don’t have any chance to know whether it’s a good choice or not.
I guess it would be difficult for you as well. And the same applies to other tasks and subjects (movies, flowers, prices). I believe that know you understand why labels are a must-have for supervised learning.
When to use supervised learning?
This method is used for two popular machine learning tasks:
Classification in which we want to assign a discrete label (class) to each sample, e.g. name of the animal, movie category or positive/negative label. To evaluate model correctness we simply check whether true and predicted classes are the same (good) or not (bad).
Regression in which we predict not a discrete label but a continuous value (number, price, area etc.). So based on given samples we want to predict e.g. future growth of prices, salaries or the temperature next day. True value (from labeled data) tells us how far the model is from the right answer with its current prediction.
These two tasks are used when we have data with both features and labels. If you want to perform a regression or classification, but you are missing just a few labels - it’s not that bad. You can either:
- find more samples that include proper labels,
- label manually them if you have an expert knowledge,
- generate more data by yourself (data augmentation - I’ll describe it in another post).
But there are also machine learning tasks that do not require any labels, classes or ground-truth.
Another type of machine learning is unsupervised learning where … you don’t need to supervise the model. It works on its own, using unlabeled data. But I already explained why unsupervised learning will not work for neither classification nor regression.
Without any prior information you cannot tell whether it’s an apple or an orange. But there’s one thing you can do easily - grouping fruits together just by looking at their colour, shape and other features. This is called clustering and is one of the most common use-cases for unsupervised learning.
When to use unsupervised learning?
This type of machine learning is used for another problems, such as:
Clustering which means grouping data into several clusters. Separating sampels by looking at their features can be done even without labels. Model tries to find similarities between samples and group them together either by similar values of features or just distances between them (see image below with diving a set of 2D points into 3 clusters - based on Euclidean distance between points).
Anomaly detection in which model looks for unusual patterns in data. Such approach may be used for marking outliers (here imagine another set of points and few of them located far from the rest - outliers) or detecting anomalies such as credit card fraud.
These tasks are typically not the first ones that you bump into when you start studying ML, but they are as much or even more important than classification or regression problems. There are many more algorithms and use-cases for these, so if you are interested in this topic, let me know so I can find some more resources for you!
You can see that machine learning is all about data, but our task or approach determines what kind of data we have to use (labeled or not?). I also mentioned that it is possible to create more of synthetic data for supervised learning (data augmentation). This is a commonly used step for solving machine learning problems, but it definitely deserves a separate post.
Detailed description of these machine learning methods like clustering etc. should also be explained separately - otherwise this post would be much, much longer. So I think it’s enough this time and I really hope this post was clear and helpful, please leave your feedback and tell me if you enjoyed it!
If you want to read more about (un)supervised learning or maybe see example usage of regression algorithms or clustering, here are some links that may be helpful for you to continue reading.