Hey everyone, This is an EDA project analyzing super store data set and visualizing it. The objective of this project is to analyze and identify trends and patterns in the current retail sales and identify which sector of the market is under loss and which sector is making huge profits. Every sector offers discounts on sales, but, do they collect profits as needed on the discounts they offer? Which shipment class boosts the sales of which category?

This tutorial will guide you through the process of retrieving answers to all these questions.

Let us get started!

Loading Packages

rm(list=ls())
library(ggplot2)
library(tidyverse)

Reading dataset

df <- read.csv("SampleSuperstore.csv")

str(df)

## 'data.frame':    9994 obs. of  13 variables:
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Country     : Factor w/ 1 level "United States": 1 1 1 1 1 1 1 1 1 1 ...
##  $ City        : Factor w/ 531 levels "Aberdeen","Abilene",..: 195 195 267 154 154 267 267 267 267 267 ...
##  $ State       : Factor w/ 49 levels "Alabama","Arizona",..: 16 16 4 9 9 4 4 4 4 4 ...
##  $ Postal.Code : int  42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
##  $ Region      : Factor w/ 4 levels "Central","East",..: 3 3 4 3 3 4 4 4 4 4 ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

summary(df)

##           Ship.Mode           Segment              Country    
##  First Class   :1538   Consumer   :5191   United States:9994  
##  Same Day      : 543   Corporate  :3020                       
##  Second Class  :1945   Home Office:1783                       
##  Standard Class:5968                                          
##                                                               
##                                                               
##                                                               
##             City               State       Postal.Code        Region    
##  New York City: 915   California  :2001   Min.   : 1040   Central:2323  
##  Los Angeles  : 747   New York    :1128   1st Qu.:23223   East   :2848  
##  Philadelphia : 537   Texas       : 985   Median :56431   South  :1620  
##  San Francisco: 510   Pennsylvania: 587   Mean   :55190   West   :3203  
##  Seattle      : 428   Washington  : 506   3rd Qu.:90008                 
##  Houston      : 377   Illinois    : 492   Max.   :99301                 
##  (Other)      :6480   (Other)     :4295                                 
##             Category         Sub.Category      Sales              Quantity    
##  Furniture      :2121   Binders    :1523   Min.   :    0.444   Min.   : 1.00  
##  Office Supplies:6026   Paper      :1370   1st Qu.:   17.280   1st Qu.: 2.00  
##  Technology     :1847   Furnishings: 957   Median :   54.490   Median : 3.00  
##                         Phones     : 889   Mean   :  229.858   Mean   : 3.79  
##                         Storage    : 846   3rd Qu.:  209.940   3rd Qu.: 5.00  
##                         Art        : 796   Max.   :22638.480   Max.   :14.00  
##                         (Other)    :3613                                      
##     Discount          Profit         
##  Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :0.2000   Median :    8.666  
##  Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :0.8000   Max.   : 8399.976  
##

Data preparation and Cleaning

Checking for abnormalities:

#any null values?
is.null(df)

## [1] FALSE

#any duplicacy?
dfnew <- df %>% distinct() ##yes, duplicates were removed

We see that there is an outlier in the Sales feature, an unusual hike. Let’s replace it with the mean of sales.

maxSales <- max(dfnew$Sales)
dfnew$Sales <- replace(dfnew$Sales, dfnew$Sales==maxSales,mean(dfnew$Sales))

Removing country and Postal Codes feature

dfnew <- dfnew %>% select(-c(Country, Postal.Code))

Visualization

Let’s analyze patterns in our cleaned data

Sales vs Quantity

In the below graph, we see the following pattern that most of the sales have been triggered by the standard class of shipment mode.

ggplot(data = dfnew, aes(x = Quantity, y = Sales, fill = Ship.Mode) )+ geom_bar(stat = "identity")

Sales vs Profit

And hence, obviously we see more profits/loss have been availed from the standard shipment class. But, there are not higher range profits seen this feature.

ggplot(data = dfnew, aes(x = Sales, y = Profit, color = Ship.Mode)) + geom_point()

Sales vs Discount

Let us see how Sales are affected if discounts are offered.

ggplot() + geom_point(data = dfnew, aes(x = Discount, y = Sales, color = Ship.Mode))

It is evident from the above graph that discounts attract more sales. But, discounts attract mostly the Standard Class shipment. Same day shipment mode receive the least disount offers.

Profits vs Discount

Let’s see whether profits have been triggered if discounts have been redeemed.

ggplot() + geom_bar(data = dfnew, aes(x = Discount, y = Profit, fill = Ship.Mode), stat = "identity")

Yes, we see clearly, the more discounts have been offered and redeemed, the lesser profits the segments have achieved. Products with no discounts show high range of profits but as the discount range increases, we only see more and more loss with hardly any profit.

Let us see if this is the case with other segments

ggplot() + geom_bar(data = dfnew, aes(x = Sub.Category, y = Profit, fill = Region), stat = "identity") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We see that more losses have been incurred by the Binders industry mainly in the Central region and Machines and * Tables * industry.

Now,

ggplot() + geom_bar(data = dfnew, aes(x = Category, y = Sales, fill = Region), stat = "identity") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

More Sales have been incurred by the technology category, then Furniture and office supplies. Mostly sales have been made from the West and East regions

ggplot() + geom_bar(data = dfnew, aes(x = Category, y = Profit, fill = Region), stat = "identity") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

The furniture category incurrs more losses than losses in the technology and Office Supplies category.

Since, Sales also vary from low to high in this category so is are profits.

ggplot() + geom_point(data = dfnew, aes(x = Sales, y = Profit, color = Category))

We have now witnessed from the above graphs that the Sales to Profit ratio is same in every category, no matter how they are clubbed.

Conclusion

Outro

So, this is it. We chose a simple, small dataset to start with and learnt how to manipulate data using a few simple functions.

Stay tuned for more tutorials!
Thank You!

SuperStore Retail Analysis
@ Shrishti Vaish · Monday, Dec 21, 2020 · 6 minute read · Update at Dec 21, 2020