The most common family portrait of machine learning you might see consists of following three members – 1) Supervised Learning 2) Unsupervised Learning 3) Reinforcement Learning. But what if I tell you that there is a distant and not so popular cousin missing here. It’s name is Semi Supervised Learning.
Semi Supervised Learning though not very commonly known to beginners still has a very important role to play in machine learning as it deals with the most common and practical problem of data collection – labeled data.
In this post we will understand what is the need for semi supervised learning and give a gentle introduction to it.
Before we jump directly into our main topic of discussion, let us discuss some of the key concepts and build step by step understanding of why we need semi supervised learning.
A data is labeled when it has a description or annotation associated with it. A label helps to determine what the data is all about or what is the class of data. Caption of animal images, loan status of the applications are some typical example of labels.
When data is not labeled it is known as unlabeled data. Without a label we cannot straight away let machine know the information about the data or it’s class. So this means, animal images without captions, loan application data without the details of loan status are examples of unlabeled data.
This is the branch of machine learning which trains on historical data that is labeled and creates a model that can classify the same labels for new unseen data.
In our example, based on it’s earlier training on already labeled data supervised Learning model can –
- Determine the caption of new images
- Decide whether to approve or reject the new loan application
In this type of learning, the data is not labeled and neither there is any training phase. It creates a model that tries to understand hidden pattern within the unlabeled data.
For more in depth comparison between supervised and unsupervised learning check below post –
Labeled Data is expensive !!
As you can see, the first and foremost prerequisite of supervised learning is that we should have some data which is already labeled so that it can train on it. But acquiring a labeled data is not always easy. Let us see how we can get label data for our two on going examples here.
Labeling Animal Images – This will need resource intensive manual effort of labeling each image individually one by one. For good training it generally requires plenty of images (in thousands). So imaging how expensive it would be to label the entire image data set.
Labeling Loan Applications – The labels (approved/rejected) of past loan applications will most likely be available in the banking system itself and can be easily extracted with minimal manual effort.
So as you can, see some data are very easy to get labeled but for others labeling needs too much of manual effort. And in some cases when data is very specific to a domain like medical records it needs a subject matter expert as well.
Hence getting labeled data for supervised learning is expensive and is a luxury !!
Semi Supervised Learning
So till this point we saw that when we have labeled data we are good to use supervised learning and for unlabeled data unsupervised learning seems to be the choice.
But what if we have a data set which is a mix of labeled & unlabeled data and we still want to create a classification model that can learn a mapping between input data and output label. This is where semi supervised learning comes into the picture.
Semi supervised learning algorithms can work with both labeled & unlabeled data and is hence considered to lie somewhere between supervised and unsupervised learning.
Semi Supervised Learning Algorithms
Some of the most renowned algorithm in semi supervised learning are –
- Self Training
- Generative Models
- Graph Based Algorithms
- Semi Supervised Support Vector Machines (S3VM)
- Multi-view Algorithm
Benefits of Semi Supervised Learning
We could have just ignored unlabeled data and had used whatever labeled data we had to create the supervised learning model. Sounds okay, but if the labeled data is really less it would result in a bad model.
Semi supervised learning algorithms helps to utilize even the unlabeled data, so it results in a more robust trained model with better decision boundary.
Assumptions for Semi Supervised Learning
The data involved should satisfy some underlying assumptions as a prerequisite for semi supervised learning –
- Smoothness Assumption – It assumes that the data which are more near to each other are more likely to have same output label.
- Cluster Assumption – It assumes that the data tends to form clusters and data in same cluster are likely to have same output label.
- Manifold Assumption – It assumes that higher dimensional data lies on low dimension manifold. Unlike previous two, this is less intuitive to understand but this assumption tries to avoid the situation involving curse of dimesnionality.
Application of semi supervised learning
The most widespread use of semi supervised learning is in the field where labeled data is very hard to acquire or requires a lot of manual effort or expertise to label them. These are some of the practical applications –
- Image Recognition – Images have to be labeled manually and if it is specific to domain that requires expertise it becomes more cumbersome and expensive.
- Protein Sequence Classification – To analyse a protein function takes a good amount of research work and to label it ultimately.
- Speech Recognition – Speech annotation is more painful. It needs a human to listen to hours of recordings and write transcript manually. Recently Alexa scientists were able to reduce the speech recognition error by up to 22% by using semi supervised learning. Read more about it here.
- Web Page Classification – With billions of web pages on internet it is very obvious that the labeled pages would be very few compared to unlabeled ones and this is where semi supervised learning can be really useful.
This might look contrary to what we have discussed above but using semi supervised learning when both labeled and unlabeled data is present does not guarantee a model with better performance.
Again all this depends on the data however. There is a good article about this here.
In The End…
The idea of using unlabeled data for classification has been around in literature for as back as 1960s though this specific term started to be used more popularly from 1990s. Semi supervised learning is still an emerging field of research and that might be one of the reason it is yet to get the recognition which its cousins in machine learning family get. I hope this post served as a good introduction of this concept to you.
Do share your feed back about this post in the comments section below. If you found this post informative, then please do share this and subscribe to us by clicking on bell icon for quick notifications of new upcoming posts. And yes, don’t forget to join our new community MLK Hub and Make AI Simple together.