Hello Shouters!! Today we will learn about data cleaning in machine Learning
In the earlier blog, we learned about the basics of pandas. In this blog, we will continue our journey to the practical implementation of our first model. If you are a noob in the field of machine learning, I will suggest that you should go to a practical approach to machine learning first.
Now, we have learned how to use google collab. So, we have to do some basic data cleaning operations. Now the question arises WHAT IS DATA CLEANING?
So the machine learning algorithms don’t understand any type of value other than numerical. But in the dataset, we have so many types of values other than numerical. These may be strings, special characters, etc. so the process of converting these values to numerical values is called data cleaning
This data cleaning is the most difficult part of machine learning. In fact, it is a saying that data scientists spend 80% of their time cleaning data. Only 20% of the time is spent on the other operations of machine learning.
In this dataset also, a lot of cleaning is required but in the beginning, we don’t have to rush too hard through this most difficult part of machine learning. I will also create separate blogs for data cleaning purpose and for now I will give you information about every line of code I have written to clean the data. But still, if you come across any data cleaning difficulty, don’t worry, we will do that in detail in further blogs.
Machine learning has two types of variables:
- Numeric variable-contains all numeric data.
- Categorial variable-contains all string data.
As you can see, only the ‘price’ column has numerical data. All other columns have either some string value or some special character in it. So all these columns contain categorial values except ‘price’.
Now we handle these categorical values by converting them to numerical values. We will write some code for that.
First, we convert the ‘Date_of_journey’ column. For this, we make 3 new columns in the data frame ‘df’ namely ‘date’,’ month’ and ‘year’ by using ‘split’ function. You can see 3 new columns were inserted into the data frame ‘df’.
While converting from categorical to numerical sometime the data type of values gets changed. So, we will convert it to ‘int’. After splitting the data, we get the values of ‘Date_of_journey’ in 3 separate columns ‘date’, ‘month’ and ‘year’. So we don’t need the ‘Date_of_journey’ column now. So we will drop it.
Now if you see the ‘Arrival_time’ column, after time some date or month written in some of the values. So we have to convert it to completely numerical. For that, we will first, remove everything which is after time.
You can see the difference in the Arrival_time column before and after this column.
Now we will analyze the total_stops column. we see that in this column there we have values like 2 stops,1 stop, non-stop. So to convert this column to numerical we will convert non stop to 0 stop and then remove ‘stop’ word from every entry.
Now we will use the above strategy for other columns as well i.e. the strategy of
–convert to int
–drop the old column
Now we see that we have converted most of the columns to numerical values except the columns route_1,route_2…….
So now we will convert these columns to numerical with the help of label encoder which we will discuss later in these blogs in detail. For now, just remember its a method that assigns some numerical value to the same categorial value. e.g. it will assign 1=BLR, 2=DBL, etc. So instead of these categorical values, we can use some integer assigned to it in our model.
Similarly, other remaining categorical values are also converted to numerical using this label encoder.
This label encoder is present in the sci-kit-learn library(which is another python library like pandas). Thus we will import this library first and then make the object of this label encoder function and then use this object to encode
Now every value in dataset is in numerical type and thus the data cleaning part is over and now we can move further to make a machine learning model which will be discussed in the next blog.
For placement preparation questions and technical interview preparation. Check the Instagram account: https://www.instagram.com/shoutcoders/
If you have any doubts about this blog, feel free to comment down. We will respond as soon as possible. Till then…
Frequently Asked Questions –
The pandas are the most commonly used library for data cleaning. You can use other libraries but pandas serve the purpose easily.
It is completely based on your common sense, dropping some columns can decrease the performance and sometimes increase the performance. It is totally on the credibility of the column.
No, the method for cleaning every data set will be different.