Exploratory Data Analysis of Web Scrapped Data.

Harshit Maheshwari
4 min readNov 20, 2023

Photo by Luke Chesser on Unsplash

In this article we will analyse the data that we extracted and cleaned in the previous tutorials. If you haven’t read the previous ones then it is recommended to read them first.
Tutorial 1: Web-scrapping Product Reviews in 3 Minutes.
Tutorial 2: How to clean a dirty Web Scrapped Data?

Just to give an overview, in this tutorial we will analyse the product review data extracted from an e-commerce platform. The cleaned data will be further used to understand the sentiments of the reviewers and also to get the answers to some important questions from the seller’s point of view.

Let’s start by importing the relevant libraries required for our Data Analysis and understand their usage:

  1. Pandas: It is one of the most important Python libraries used for data analysis tasks. We will use this library to read the dataframe and perform calculations on dataframe.
  2. Matplotlib: This library is mainly used for plotting graphs.
  3. Seaborn: This library works over Matplotlib and gives a better visualisation of the graphs and plots.
  4. NLTK: This is an important library in the language processing world. We will use this to play around with the reviews and texts.
  5. Wordcloud: A library used to create word cloud for a large chunk of words.
  6. RE: This library is used to work with regular expressions.

Let’s read the crocs_reviews.csv file (created in the last tutorial) which contains the scrapped data into a dataframe.

From above, we can see that the size of the dataframe is 8696 * 9, that is, it contains 8696 rows and 9 columns. Next we plot the data in the size column to get a better understanding of the data.

The above graph shows the Crocs Size and the review counts, in men category, showing the crocs that got the highest reviews. From this graph we can also interpret that Size 8 could be the highest selling size in this category.

The output graph of the above code tells us that from the year 2018 till year 2020, there has been an exponential rise in the number of ratings given and the 5 start rating reached the peak by the end of 2020.

Let us understand the analysis process by answering some questions.

Question 1: What are the top 5 most important words used in the Product Review ?

This question can be answered easily by using the text mining techniques and wordcloud. First, we will join all the reviews into a single paragraph. Since we don’t want any unwanted symbols, we will then remove those, if any.

Afterwards, we will split the complete string into individual words, so that we can remove the stopwords from them. At this time we add the product name to the list of the stop_words to get a better result.

After removing the stopwords, we will join the words again and form a paragraph.

Summary and Conclusion:

From the above analysis task, We have learnt many things as stated below:

  1. Analysis of products is very important to increase sales and profit margins. Any retail business needs proper analysis.
  2. At the same time, analysis of customer reviews helps to understand the customer satisfaction level, what they like and do not like about the product. Their reviews answer the questions that the ratings or product sales cannot answer.
  3. It is very important to understand the business and to gain the domain knowledge to bring out the solutions.
  4. We have learned that it takes more time to fetch and clean the data as compared to analysing it. 😃
  5. Correctness of data is really important to get the proper solutions. Improper data can lead to wrong decisions and hence serious problems.
  6. Visual representations can solve more problems than just fiddling around numbers.

The most enjoyable aspect of this analysis work was deciphering the reviews. It would be fantastic to delve deeper into the Natural Language Processing section of the analysis in the future. In subsequent posts, I’ll go over specific ideas like vectorizations, TF-IDF, and recommender systems. In addition, I will examine data obtained from Walmart and other significant e-commerce companies. I will not only analyze e-commerce sales, but will also construct recommender systems similar to Netflix and YouTube.

Happy Learning! :)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Harshit Maheshwari
Harshit Maheshwari

Written by Harshit Maheshwari

Cultivating AI insights for over 5 years, I'm on a mission to demystify the machine learning landscape, one Medium article at a time.

No responses yet

Write a response

Recommended from Medium

Lists

See more recommendations