Member-only story

How to clean a dirty Web Scrapped Data?

Harshit Maheshwari
3 min readJun 21, 2021

In this article we will clean the data that we extracted in the previous tutorial. If you haven’t read the previous one then it is recommended to read it first by clicking here.

Just to give an overview, in this tutorial we will clean the product review data extracted from an e-commerce platform. The cleaned data will be further used to understand the sentiments of the reviewers and also to get the answers of some important questions from the seller point of view.

Let’s read the crocs_reviews.csv file (created in the last tutorial) which contains the scrapped data into a dataframe.

Pandas make it very easy to read the data from a data frame. From here we can see that the data frame contains 7 columns and 8696 rows. Each column needs to go through multiple steps for cleaning the data to bring it in a better format to carry the analysis part.

We start by cleaning the first column that is Title. Here, by looking at the dataset we can see that there are multiple ‘\n’ at the beginning and the end of every value. So we remove those values to get clear string values.

Secondly, we know from the .info method that there are null values in this column, which we will replace by 'No title'…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Harshit Maheshwari
Harshit Maheshwari

Written by Harshit Maheshwari

Cultivating AI insights for over 5 years, I'm on a mission to demystify the machine learning landscape, one Medium article at a time.

No responses yet

Write a response