Let’s analyze the marketing data set provided by an user of Kaggle.
setwd("C:/Users/Daniel Gutierrez/Desktop/R Practice/Marketing Data Kaggle")
marketing_data <- readr::read_csv("marketing_data.csv")
Simple summary statistics of the data sets to see what variables we have
summary(marketing_data)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1893 Length:2240 Length:2240
## 1st Qu.: 2828 1st Qu.:1959 Class :character Class :character
## Median : 5458 Median :1970 Mode :character Mode :character
## Mean : 5592 Mean :1969
## 3rd Qu.: 8428 3rd Qu.:1977
## Max. :11191 Max. :1996
## Income Kidhome Teenhome Dt_Customer
## Length:2240 Min. :0.0000 Min. :0.0000 Length:2240
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Mode :character Median :0.0000 Median :0.0000 Mode :character
## Mean :0.4442 Mean :0.5062
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2.0000 Max. :2.0000
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16.0
## Median :49.00 Median : 173.50 Median : 8.0 Median : 67.0
## Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 166.9
## 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232.0
## Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725.0
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
## Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.07277 Mean :0.07455 Mean :0.07277 Mean :0.06429
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## AcceptedCmp2 Response Complain Country
## Min. :0.00000 Min. :0.0000 Min. :0.000000 Length:2240
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000000 Class :character
## Median :0.00000 Median :0.0000 Median :0.000000 Mode :character
## Mean :0.01339 Mean :0.1491 Mean :0.009375
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.0000 Max. :1.000000
str(marketing_data)
## spec_tbl_df [2,240 x 28] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : num [1:2240] 1826 1 10476 1386 5371 ...
## $ Year_Birth : num [1:2240] 1970 1961 1958 1967 1989 ...
## $ Education : chr [1:2240] "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr [1:2240] "Divorced" "Single" "Married" "Together" ...
## $ Income : chr [1:2240] "$84,835.00" "$57,091.00" "$67,267.00" "$32,474.00" ...
## $ Kidhome : num [1:2240] 0 0 0 1 1 0 0 0 0 0 ...
## $ Teenhome : num [1:2240] 0 0 1 1 0 0 0 1 1 1 ...
## $ Dt_Customer : chr [1:2240] "6/16/14" "6/15/14" "5/13/14" "5/11/14" ...
## $ Recency : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ MntWines : num [1:2240] 189 464 134 10 6 336 769 78 384 384 ...
## $ MntFruits : num [1:2240] 104 5 11 0 16 130 80 0 0 0 ...
## $ MntMeatProducts : num [1:2240] 379 64 59 1 24 411 252 11 102 102 ...
## $ MntFishProducts : num [1:2240] 111 7 15 0 11 240 15 0 21 21 ...
## $ MntSweetProducts : num [1:2240] 189 0 2 0 0 32 34 0 32 32 ...
## $ MntGoldProds : num [1:2240] 218 37 30 0 34 43 65 7 5 5 ...
## $ NumDealsPurchases : num [1:2240] 1 1 1 1 2 1 1 1 3 3 ...
## $ NumWebPurchases : num [1:2240] 4 7 3 1 3 4 10 2 6 6 ...
## $ NumCatalogPurchases: num [1:2240] 4 3 2 0 1 7 10 1 2 2 ...
## $ NumStorePurchases : num [1:2240] 6 7 5 2 2 5 7 3 9 9 ...
## $ NumWebVisitsMonth : num [1:2240] 1 5 2 7 7 2 6 5 4 4 ...
## $ AcceptedCmp3 : num [1:2240] 0 0 0 0 1 0 1 0 0 0 ...
## $ AcceptedCmp4 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : num [1:2240] 0 1 0 0 0 0 0 0 0 0 ...
## $ Response : num [1:2240] 1 1 0 0 1 1 1 0 0 0 ...
## $ Complain : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ Country : chr [1:2240] "SP" "CA" "US" "AUS" ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. Year_Birth = col_double(),
## .. Education = col_character(),
## .. Marital_Status = col_character(),
## .. Income = col_character(),
## .. Kidhome = col_double(),
## .. Teenhome = col_double(),
## .. Dt_Customer = col_character(),
## .. Recency = col_double(),
## .. MntWines = col_double(),
## .. MntFruits = col_double(),
## .. MntMeatProducts = col_double(),
## .. MntFishProducts = col_double(),
## .. MntSweetProducts = col_double(),
## .. MntGoldProds = col_double(),
## .. NumDealsPurchases = col_double(),
## .. NumWebPurchases = col_double(),
## .. NumCatalogPurchases = col_double(),
## .. NumStorePurchases = col_double(),
## .. NumWebVisitsMonth = col_double(),
## .. AcceptedCmp3 = col_double(),
## .. AcceptedCmp4 = col_double(),
## .. AcceptedCmp5 = col_double(),
## .. AcceptedCmp1 = col_double(),
## .. AcceptedCmp2 = col_double(),
## .. Response = col_double(),
## .. Complain = col_double(),
## .. Country = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
ID: Customer’s unique identifier
Year_Birth: Customer’s birth year
Education: Customer’s education level
Marital_Status: Customer’s marital status
Income: Customer’s yearly household income
Kidhome: Number of children in customer’s household
Teenhome: Number of teenagers in customer’s household
Dt_Customer: Date of customer’s enrollment with the company
Recency: Number of days since customer’s last purchase
MntWines: Amount spent on wine in the last 2 years
MntFruits: Amount spent on wine in the last 2 years
MntMeatProducts: Amount spent on meat in the last 2 years
MntFishProducts: Amount spent on fish in the last 2 years
MntSweetProducts: Amount spent on sweets in the last 2 years
MntGoldProds: Amount spent on gold in the last 2 years
NumDealsPurchase: Number of purchases made with a discount
NumWebPurchases: Number of purchases made through the company’s web site
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s web site in the last month
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
Complain: 1 if customer complained in the last 2 years, 0 otherwise
Country: Customer’s location
Note that we have a variable, “year_birth”, for the costumers which probably is hard to interpret in a model. We would have to build an age variable. Also, note that the income variable is a string variable. It includes the dollar symbol, “$”, before the yearly income. Dt_Customer is a supposed to be a date.
We have to transform the income variable into a numerical variable. Marital status, education, accepted* and response as factors. Maybe country as a factor too. Also an age variable: 2021 - year_birth.
marketing <- marketing_data %>%
mutate(Income = as.numeric(gsub("[^0-9.-]", "",Income)),
Dt_Customer = as.Date(Dt_Customer, "%m/%d/%y"),
Education = as.factor(Education),
Marital_Status = as.factor(Marital_Status),
Country = as.factor(Country),
AcceptedCmp1 = as.factor(AcceptedCmp1),
AcceptedCmp2 = as.factor(AcceptedCmp2),
AcceptedCmp3 = as.factor(AcceptedCmp3),
AcceptedCmp4 = as.factor(AcceptedCmp4),
AcceptedCmp5 = as.factor(AcceptedCmp5),
Response = as.factor(Response),
Age = 2021 - Year_Birth)
marketing
## # A tibble: 2,240 x 29
## ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer
## <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <date>
## 1 1826 1970 Graduation Divorced 84835 0 0 2014-06-16
## 2 1 1961 Graduation Single 57091 0 0 2014-06-15
## 3 10476 1958 Graduation Married 67267 0 1 2014-05-13
## 4 1386 1967 Graduation Together 32474 1 1 2014-05-11
## 5 5371 1989 Graduation Single 21474 1 0 2014-04-08
## 6 7348 1958 PhD Single 71691 0 0 2014-03-17
## 7 4073 1954 2n Cycle Married 63564 0 0 2014-01-29
## 8 1991 1967 Graduation Together 44931 0 1 2014-01-18
## 9 4047 1954 PhD Married 65324 0 1 2014-01-11
## 10 9477 1954 PhD Married 65324 0 1 2014-01-11
## # ... with 2,230 more rows, and 21 more variables: Recency <dbl>,
## # MntWines <dbl>, MntFruits <dbl>, MntMeatProducts <dbl>,
## # MntFishProducts <dbl>, MntSweetProducts <dbl>, MntGoldProds <dbl>,
## # NumDealsPurchases <dbl>, NumWebPurchases <dbl>, NumCatalogPurchases <dbl>,
## # NumStorePurchases <dbl>, NumWebVisitsMonth <dbl>, AcceptedCmp3 <fct>,
## # AcceptedCmp4 <fct>, AcceptedCmp5 <fct>, AcceptedCmp1 <fct>,
## # AcceptedCmp2 <fct>, Response <fct>, Complain <dbl>, Country <fct>, ...
The raw the data set now looks better. But there are some NA’s. Let’s investigate them.
summary(marketing)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1893 2n Cycle : 203 Married :864
## 1st Qu.: 2828 1st Qu.:1959 Basic : 54 Together:580
## Median : 5458 Median :1970 Graduation:1127 Single :480
## Mean : 5592 Mean :1969 Master : 370 Divorced:232
## 3rd Qu.: 8428 3rd Qu.:1977 PhD : 486 Widow : 77
## Max. :11191 Max. :1996 Alone : 3
## (Other) : 4
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Min. :2012-07-30
## 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2013-01-16
## Median : 51382 Median :0.0000 Median :0.0000 Median :2013-07-08
## Mean : 52247 Mean :0.4442 Mean :0.5062 Mean :2013-07-10
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2013-12-30
## Max. :666666 Max. :2.0000 Max. :2.0000 Max. :2014-06-29
## NA's :24
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16.0
## Median :49.00 Median : 173.50 Median : 8.0 Median : 67.0
## Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 166.9
## 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232.0
## Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725.0
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
## Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Response
## 0:2077 0:2073 0:2077 0:2096 0:2210 0:1906
## 1: 163 1: 167 1: 163 1: 144 1: 30 1: 334
##
##
##
##
##
## Complain Country Age
## Min. :0.000000 SP :1095 Min. : 25.00
## 1st Qu.:0.000000 SA : 337 1st Qu.: 44.00
## Median :0.000000 CA : 268 Median : 51.00
## Mean :0.009375 AUS : 160 Mean : 52.19
## 3rd Qu.:0.000000 IND : 148 3rd Qu.: 62.00
## Max. :1.000000 GER : 120 Max. :128.00
## (Other): 112
There are 24 NA’s for the income variable. We could substitute them by the mean of the income depending on their education. We could also delete them. Let’s try both approaches and compare the results.
marketing_imp <- marketing %>%
group_by(Education) %>%
mutate(Income = ifelse(is.na(Income), mean(Income, na.rm = TRUE), Income))
summary(marketing_imp)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1893 2n Cycle : 203 Married :864
## 1st Qu.: 2828 1st Qu.:1959 Basic : 54 Together:580
## Median : 5458 Median :1970 Graduation:1127 Single :480
## Mean : 5592 Mean :1969 Master : 370 Divorced:232
## 3rd Qu.: 8428 3rd Qu.:1977 PhD : 486 Widow : 77
## Max. :11191 Max. :1996 Alone : 3
## (Other) : 4
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Min. :2012-07-30
## 1st Qu.: 35539 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2013-01-16
## Median : 51610 Median :0.0000 Median :0.0000 Median :2013-07-08
## Mean : 52254 Mean :0.4442 Mean :0.5062 Mean :2013-07-10
## 3rd Qu.: 68290 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2013-12-30
## Max. :666666 Max. :2.0000 Max. :2.0000 Max. :2014-06-29
##
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16.0
## Median :49.00 Median : 173.50 Median : 8.0 Median : 67.0
## Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 166.9
## 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232.0
## Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725.0
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
## Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Response
## 0:2077 0:2073 0:2077 0:2096 0:2210 0:1906
## 1: 163 1: 167 1: 163 1: 144 1: 30 1: 334
##
##
##
##
##
## Complain Country Age
## Min. :0.000000 SP :1095 Min. : 25.00
## 1st Qu.:0.000000 SA : 337 1st Qu.: 44.00
## Median :0.000000 CA : 268 Median : 51.00
## Mean :0.009375 AUS : 160 Mean : 52.19
## 3rd Qu.:0.000000 IND : 148 3rd Qu.: 62.00
## Max. :1.000000 GER : 120 Max. :128.00
## (Other): 112
No more NA’s. Now we can start an uni-variate analysis to check for implausible values and outliers. Age and income are already causing me suspicious. Maximum value of Age is 128 years old with a mean of 52 and a 75% quantile of 62, probably this person of 128 is an outlier or just a mistake when writing the dataset. For income the maximum value is $666,666 per year, with a mean of $52,254 and 75% of the participants earn less than $68,290. The maximum value is about 10 times higher. The number itself its already suspicious as it is a sequence of six sixes. It is very likely that this is not a real person. But we analyze it.
hist_age <- marketing_imp %>%
ggplot(aes(Age))+
geom_histogram(bins = 50, fill = "#24B0B7", alpha = 0.8)+
labs(title = "Age Histogram")+
ylab(NULL)
hist_age
sum(marketing_imp$Age > 100)
## [1] 3
Three participants older than 100 years old. Since this is a digital commerce, I will drop these observations. It is unlikely that the people in this age buy online.
hist_income <- marketing_imp %>%
ggplot(aes(Income))+
geom_histogram(bins = 50, fill = "#24B0B7", alpha = 0.8)+
labs(title = "Income Histogram")+
ylab(NULL)
require(scales)
hist_income + scale_x_continuous(labels = comma)
sum(marketing_imp$Income > 100000 )
## [1] 13
There are 13 participants that earn more than $100,000 dollars per year. I will also drop these observations as they might induce noise to the analyses.
marketing_ready <- marketing_imp %>%
filter(Income < 100000 & Age < 100)
summary(marketing_ready)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1940 2n Cycle : 201 Married :861
## 1st Qu.: 2828 1st Qu.:1959 Basic : 54 Together:573
## Median : 5458 Median :1970 Graduation:1121 Single :477
## Mean : 5589 Mean :1969 Master : 369 Divorced:229
## 3rd Qu.: 8422 3rd Qu.:1977 PhD : 479 Widow : 77
## Max. :11191 Max. :1996 Alone : 3
## (Other) : 4
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Min. :2012-07-30
## 1st Qu.:35376 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2013-01-16
## Median :51412 Median :0.0000 Median :0.0000 Median :2013-07-09
## Mean :51515 Mean :0.4456 Mean :0.5085 Mean :2013-07-10
## 3rd Qu.:67911 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2013-12-30
## Max. :98777 Max. :2.0000 Max. :2.0000 Max. :2014-06-29
##
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 24.0 1st Qu.: 1.75 1st Qu.: 16.0
## Median :49.50 Median : 176.0 Median : 8.00 Median : 67.0
## Mean :49.14 Mean : 304.5 Mean : 26.23 Mean : 165.3
## 3rd Qu.:74.00 3rd Qu.: 504.2 3rd Qu.: 33.00 3rd Qu.: 231.2
## Max. :99.00 Max. :1493.0 Max. :199.00 Max. :1725.0
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.57 Mean : 26.91 Mean : 43.95 Mean : 2.325
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.000 Median : 6.000
## Mean : 4.085 Mean : 2.627 Mean : 5.804 Mean : 5.341
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.000 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Response
## 0:2061 0:2060 0:2066 0:2084 0:2195 0:1894
## 1: 163 1: 164 1: 158 1: 140 1: 29 1: 330
##
##
##
##
##
## Complain Country Age
## Min. :0.000000 SP :1089 Min. :25.00
## 1st Qu.:0.000000 SA : 332 1st Qu.:44.00
## Median :0.000000 CA : 266 Median :51.00
## Mean :0.008993 AUS : 160 Mean :52.11
## 3rd Qu.:0.000000 IND : 146 3rd Qu.:62.00
## Max. :1.000000 GER : 120 Max. :81.00
## (Other): 111
marketing_ready %>%
ggplot(aes(Age))+
geom_histogram(bins = 50, fill = "#24B0B7", alpha = 0.8)+
labs(title = "Age Histogram")+
ylab(NULL)
marketing_ready %>%
ggplot(aes(Income))+
geom_histogram(bins = 50, fill = "#24B0B7", alpha = 0.8)+
labs(title = "Income Histogram")+
ylab(NULL) + scale_x_continuous(labels = comma)
Both variables look better now. Now lets look at the distribution of the remaining continous variables and categorical variables.
hist_wines <- marketing_ready %>%
ggplot(aes(MntWines))+
geom_histogram(bins = 50, fill = "#24B0B7", alpha = 0.8)+
labs(title = "Wine Histogram")+
ylab(NULL) + scale_x_continuous(labels = comma)
hist_fruits <- marketing_ready %>%
ggplot(aes(MntFruits))+
geom_histogram(bins = 50, fill = "#24B0B7", alpha = 0.8)+
labs(title = "Fruits Histogram")+
ylab(NULL) + scale_x_continuous(labels = comma)
hist_MeatProducts <- marketing_ready %>%
ggplot(aes(MntMeatProducts))+
geom_histogram(bins = 50, fill = "#24B0B7", alpha = 0.8)+
labs(title = "Meat Products Histogram")+
ylab(NULL) + scale_x_continuous(labels = comma)
hist_FishProducts <- marketing_ready %>%
ggplot(aes(MntFishProducts))+
geom_histogram(bins = 50, fill = "#24B0B7", alpha = 0.8)+
labs(title = "Fish Products Histogram")+
ylab(NULL) + scale_x_continuous(labels = comma)
hist_SweetProducts <- marketing_ready %>%
ggplot(aes(MntSweetProducts))+
geom_histogram(bins = 50, fill = "#24B0B7", alpha = 0.8)+
labs(title = "Sweet Products Histogram")+
ylab(NULL) + scale_x_continuous(labels = comma)
hist_GoldProds <- marketing_ready %>%
ggplot(aes(MntGoldProds))+
geom_histogram(bins = 50, fill = "#24B0B7", alpha = 0.8)+
labs(title = "Gold Products Histogram")+
ylab(NULL) + scale_x_continuous(labels = comma)
grid.arrange(hist_wines, hist_fruits, hist_MeatProducts, hist_FishProducts,
hist_SweetProducts, hist_GoldProds, ncol=3)
There are a lot of zeros for the distribution of the amount spend on each of the products. A few outliers in the expenditure behavior of the costumers. For example, the meat products histogram shows a point at the right of the distribution with more than 1,500 USD spent in the last two years. The sweet products histogram shows a similar behavior. The gold products are more sold, they have at least higher number of observations with one dollar spent.
par(mfrow = c(3,3))
plot(marketing_ready$Education, main = "Education")
plot(marketing_ready$Marital_Status, main = "Marital Status")
plot(marketing_ready$Country, main = "Country")
plot(marketing_ready$AcceptedCmp1, main = "Offer 1st Campaign Success")
plot(marketing_ready$AcceptedCmp2, main = "Offer 2nd Campaign Success")
plot(marketing_ready$AcceptedCmp3, main = "Offer 3rd Campaign Success")
plot(marketing_ready$AcceptedCmp4, main = "Offer 4th Campaign Success")
plot(marketing_ready$AcceptedCmp1, main = "Offer 5th Campaign Success")
plot(marketing_ready$Response, main = "Last offer Success")
table(marketing_ready$Marital_Status)
##
## Absurd Alone Divorced Married Single Together Widow YOLO
## 2 3 229 861 477 573 77 2
Probably would make sense to merge Alone with the Single category and drop “Absurd” and “YOLO”.
marketing_ready$Marital_Status <- as.factor(ifelse(as.character(marketing_ready$Marital_Status) == "Alone", "Single",as.character(marketing_ready$Marital_Status)))
table(marketing_ready$Marital_Status)
##
## Absurd Divorced Married Single Together Widow YOLO
## 2 229 861 480 573 77 2
marketing_ready <- marketing_ready %>%
filter(Marital_Status != "Absurd" & Marital_Status != "YOLO") %>%
mutate(Marital_Status = factor(Marital_Status))
summary(marketing_ready)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1940 2n Cycle : 201 Divorced:229
## 1st Qu.: 2828 1st Qu.:1959 Basic : 54 Married :861
## Median : 5458 Median :1970 Graduation:1120 Single :480
## Mean : 5588 Mean :1969 Master : 368 Together:573
## 3rd Qu.: 8422 3rd Qu.:1977 PhD : 477 Widow : 77
## Max. :11191 Max. :1996
##
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Min. :2012-07-30
## 1st Qu.:35336 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2013-01-16
## Median :51412 Median :0.0000 Median :0.0000 Median :2013-07-09
## Mean :51499 Mean :0.4464 Mean :0.5086 Mean :2013-07-10
## 3rd Qu.:67911 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2013-12-30
## Max. :98777 Max. :2.0000 Max. :2.0000 Max. :2014-06-29
##
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 24.0 1st Qu.: 1.0 1st Qu.: 16.0
## Median :50.00 Median : 174.5 Median : 8.0 Median : 67.0
## Mean :49.18 Mean : 304.4 Mean : 26.2 Mean : 165.3
## 3rd Qu.:74.00 3rd Qu.: 505.0 3rd Qu.: 33.0 3rd Qu.: 231.2
## Max. :99.00 Max. :1493.0 Max. :199.0 Max. :1725.0
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.45 Mean : 26.92 Mean : 43.81 Mean : 2.323
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.000 Median : 6.000
## Mean : 4.082 Mean : 2.624 Mean : 5.803 Mean : 5.342
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.000 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Response
## 0:2057 0:2056 0:2063 0:2081 0:2191 0:1892
## 1: 163 1: 164 1: 157 1: 139 1: 29 1: 328
##
##
##
##
##
## Complain Country Age
## Min. :0.000000 SP :1089 Min. :25.00
## 1st Qu.:0.000000 SA : 332 1st Qu.:44.00
## Median :0.000000 CA : 264 Median :51.00
## Mean :0.009009 AUS : 159 Mean :52.12
## 3rd Qu.:0.000000 IND : 145 3rd Qu.:62.00
## Max. :1.000000 GER : 120 Max. :81.00
## (Other): 111
We could also create a variable for total purchases in the last two years
marketing_ready <- marketing_ready %>%
mutate(total_purch = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds)
marketing_ready %>%
ggplot(aes(total_purch)) +
geom_histogram()
summary(marketing_ready)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1940 2n Cycle : 201 Divorced:229
## 1st Qu.: 2828 1st Qu.:1959 Basic : 54 Married :861
## Median : 5458 Median :1970 Graduation:1120 Single :480
## Mean : 5588 Mean :1969 Master : 368 Together:573
## 3rd Qu.: 8422 3rd Qu.:1977 PhD : 477 Widow : 77
## Max. :11191 Max. :1996
##
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Min. :2012-07-30
## 1st Qu.:35336 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2013-01-16
## Median :51412 Median :0.0000 Median :0.0000 Median :2013-07-09
## Mean :51499 Mean :0.4464 Mean :0.5086 Mean :2013-07-10
## 3rd Qu.:67911 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2013-12-30
## Max. :98777 Max. :2.0000 Max. :2.0000 Max. :2014-06-29
##
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 24.0 1st Qu.: 1.0 1st Qu.: 16.0
## Median :50.00 Median : 174.5 Median : 8.0 Median : 67.0
## Mean :49.18 Mean : 304.4 Mean : 26.2 Mean : 165.3
## 3rd Qu.:74.00 3rd Qu.: 505.0 3rd Qu.: 33.0 3rd Qu.: 231.2
## Max. :99.00 Max. :1493.0 Max. :199.0 Max. :1725.0
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.45 Mean : 26.92 Mean : 43.81 Mean : 2.323
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.000 Median : 6.000
## Mean : 4.082 Mean : 2.624 Mean : 5.803 Mean : 5.342
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.000 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Response
## 0:2057 0:2056 0:2063 0:2081 0:2191 0:1892
## 1: 163 1: 164 1: 157 1: 139 1: 29 1: 328
##
##
##
##
##
## Complain Country Age total_purch
## Min. :0.000000 SP :1089 Min. :25.00 Min. : 5.0
## 1st Qu.:0.000000 SA : 332 1st Qu.:44.00 1st Qu.: 69.0
## Median :0.000000 CA : 264 Median :51.00 Median : 395.5
## Mean :0.009009 AUS : 159 Mean :52.12 Mean : 604.1
## 3rd Qu.:0.000000 IND : 145 3rd Qu.:62.00 3rd Qu.:1039.2
## Max. :1.000000 GER : 120 Max. :81.00 Max. :2525.0
## (Other): 111
marketing_ready %>%
ggplot(aes(Income, log(total_purch)))+
geom_point()+
geom_smooth()
There is a non-linear relationship between amount spent and income.
To answer question 1 we can regress the number of store purchases on age, education, marital status, income, number of kids and teenagers at home and country. Also a second regression could include the offers the costumer took. A third regression could also include an indicator variable for each type of product to identify which products are acquired in a store. It is interesting to investigate a count model that estimates the probabilities of the costumers purchasing in a store.
marketing_ready <- marketing_ready %>%
mutate(wine = factor(ifelse(MntWines > 0, 1, 0)),
fruits = factor(ifelse(MntFruits > 0, 1, 0)),
meat = factor(ifelse(MntMeatProducts > 0, 1, 0)),
fish = factor(ifelse(MntFishProducts > 0, 1, 0)),
sweet = factor(ifelse(MntSweetProducts > 0, 1, 0)),
gold = factor(ifelse(MntGoldProds > 0, 1, 0)),
store_purchase = factor(ifelse(NumStorePurchases > 0, 1,0)))
store_ols1 <- lm(NumStorePurchases ~ Age + Education + Marital_Status + Income + Kidhome + Teenhome + Country, data = marketing_ready)
store_ols2 <- lm(NumStorePurchases ~ Age + Education + Marital_Status + Income + Kidhome + Teenhome + Country + AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + Response, data = marketing_ready)
store_ols3 <- lm(NumStorePurchases ~ Age + Education + Marital_Status + Income + Kidhome + Teenhome + Country + AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + Response + wine + fruits + meat + fish + sweet + gold, data = marketing_ready)
store_negbin1 <- glm(NumStorePurchases ~ Age + Education + Marital_Status + Income + Kidhome + Teenhome + Country + AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + Response + wine + fruits + meat + fish + sweet + gold, family = negative.binomial(theta = 1), data = marketing_ready)
stargazer(store_ols1, store_ols2, store_ols3, store_negbin1, type = "text", align = TRUE, no.space = TRUE)
##
## ===============================================================================================================================
## Dependent variable:
## --------------------------------------------------------------------------------------------------------
## NumStorePurchases
## OLS glm: Negative Binomial(1)
## link = log
## (1) (2) (3) (4)
## -------------------------------------------------------------------------------------------------------------------------------
## Age -0.011** -0.012** -0.010** -0.001*
## (0.005) (0.005) (0.005) (0.001)
## EducationBasic 0.048 0.100 0.062 -0.032
## (0.361) (0.357) (0.358) (0.061)
## EducationGraduation -0.127 -0.122 -0.065 -0.016
## (0.177) (0.175) (0.174) (0.029)
## EducationMaster -0.075 -0.062 0.103 0.022
## (0.204) (0.202) (0.202) (0.033)
## EducationPhD -0.168 -0.132 0.161 0.033
## (0.197) (0.195) (0.200) (0.033)
## Marital_StatusMarried 0.165 0.136 0.163 0.047*
## (0.172) (0.171) (0.169) (0.028)
## Marital_StatusSingle -0.009 -0.013 0.025 0.011
## (0.187) (0.184) (0.183) (0.030)
## Marital_StatusTogether 0.043 -0.014 0.015 0.016
## (0.180) (0.178) (0.177) (0.029)
## Marital_StatusWidow 0.065 0.090 0.132 0.063
## (0.305) (0.301) (0.299) (0.048)
## Income 0.0001*** 0.0001*** 0.0001*** 0.00002***
## (0.00000) (0.00000) (0.00000) (0.00000)
## Kidhome -1.226*** -1.196*** -1.089*** -0.205***
## (0.109) (0.107) (0.108) (0.018)
## Teenhome 0.182* 0.001 0.110 0.048***
## (0.097) (0.099) (0.100) (0.016)
## CountryCA 0.162 0.152 0.114 0.016
## (0.231) (0.228) (0.226) (0.037)
## CountryGER 0.240 0.211 0.228 0.031
## (0.279) (0.275) (0.273) (0.044)
## CountryIND 0.152 0.106 0.151 0.021
## (0.265) (0.262) (0.259) (0.043)
## CountryME -0.338 -0.015 -0.081 0.081
## (1.340) (1.326) (1.315) (0.210)
## CountrySA 0.327 0.302 0.288 0.049
## (0.222) (0.219) (0.217) (0.036)
## CountrySP 0.261 0.278 0.273 0.037
## (0.195) (0.193) (0.192) (0.031)
## CountryUS 0.426 0.401 0.346 0.054
## (0.287) (0.283) (0.281) (0.046)
## AcceptedCmp11 -0.493** -0.446** -0.071**
## (0.227) (0.226) (0.036)
## AcceptedCmp21 0.652 0.871* 0.182**
## (0.451) (0.448) (0.071)
## AcceptedCmp31 -0.472** -0.416** -0.114***
## (0.194) (0.193) (0.032)
## AcceptedCmp41 0.384* 0.573*** 0.116***
## (0.205) (0.205) (0.033)
## AcceptedCmp51 -0.886*** -0.877*** -0.178***
## (0.230) (0.228) (0.036)
## Response1 -0.527*** -0.612*** -0.076***
## (0.154) (0.153) (0.025)
## wine1 0.326 0.116
## (0.673) (0.118)
## fruits1 0.180 0.043*
## (0.144) (0.024)
## meat1 0.248 0.203
## (2.348) (0.426)
## fish1 0.336** 0.075***
## (0.147) (0.024)
## sweet1 0.650*** 0.129***
## (0.142) (0.023)
## gold1 0.425 0.090*
## (0.297) (0.049)
## Constant 1.836*** 1.776*** -0.339 0.215
## (0.404) (0.402) (2.285) (0.415)
## -------------------------------------------------------------------------------------------------------------------------------
## Observations 2,220 2,220 2,220 2,220
## R2 0.502 0.516 0.527
## Adjusted R2 0.497 0.510 0.520
## Log Likelihood -6,128.878
## Akaike Inf. Crit. 12,321.760
## Residual Std. Error 2.293 (df = 2200) 2.263 (df = 2194) 2.241 (df = 2188)
## F Statistic 116.601*** (df = 19; 2200) 93.464*** (df = 25; 2194) 78.542*** (df = 31; 2188)
## ===============================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The OLS models hint us that the factors driven store purchases are kids and teenagers at home, income, the offer accepted at the moment of the purchase and the product that the costumer buys. The count model confirms these findings. It seems that the second and fourth offer affects positively purchasing in stores. Fish and sweets are the products that also contributes to more purchases. Income follows the same pattern, and surprisingly age contributes negatively to store purchases.
To answer question 2, lets plot a graph to observe how the country contribution to total purchases.
marketing_ready %>%
ggplot(aes(Country, total_purch, fill = Country)) +
geom_col() +
scale_y_continuous() +
coord_flip() +
scale_fill_brewer(palette = "Spectral")
By comparing only countries total purchases, Spain is doing better than any other country. However, Spain is also the country with more costumers.
summary(marketing_ready$Country)
## AUS CA GER IND ME SA SP US
## 159 264 120 145 3 332 1089 108
Lets use a boxplot to investigate this,
marketing_ready %>%
ggplot(aes(total_purch, Country, fill = Country)) +
geom_boxplot(outlier.alpha = 0.5) +
scale_x_continuous() +
scale_fill_brewer(palette = "Spectral")
Now we can see that the median of the US is higher than Spain. Mexico is by far the country with more total purchases.
marketing_ready %>%
group_by(Country) %>%
mutate(mean_country = mean(total_purch)) %>%
ggplot(aes(Country, mean_country, colour = Country)) +
geom_point() +
scale_colour_brewer(palette = "Spectral")
In average, the US has a higher total purchases value than SP, IND and AUS but GER and CA are slightly better than US, and ME is better. We can run a regression to see these effects.
marketing_country <- marketing_ready %>%
mutate(Country = relevel(Country, ref = "US"))
totalpurch1 <- lm(log(total_purch) ~ Age + Income + I(Income^2) + Country + Kidhome + Teenhome + AcceptedCmp1 + AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + Response + Complain, data = marketing_country)
stargazer(totalpurch1, type = "text", aling = TRUE, no.space = TRUE)
##
## ===============================================
## Dependent variable:
## ---------------------------
## log(total_purch)
## -----------------------------------------------
## Age -0.003*
## (0.002)
## Income 0.0001***
## (0.00000)
## I(Income2) -0.000***
## (0.000)
## CountryAUS -0.218**
## (0.094)
## CountryCA -0.141
## (0.086)
## CountryGER -0.278***
## (0.100)
## CountryIND -0.096
## (0.096)
## CountryME 0.330
## (0.441)
## CountrySA -0.117
## (0.083)
## CountrySP -0.169**
## (0.076)
## Kidhome -0.661***
## (0.036)
## Teenhome -0.138***
## (0.034)
## AcceptedCmp11 -0.051
## (0.076)
## AcceptedCmp21 -0.067
## (0.149)
## AcceptedCmp31 0.160**
## (0.064)
## AcceptedCmp41 0.277***
## (0.068)
## AcceptedCmp51 -0.232***
## (0.079)
## Response1 0.307***
## (0.050)
## Complain -0.127
## (0.169)
## Constant 3.252***
## (0.142)
## -----------------------------------------------
## Observations 2,220
## R2 0.744
## Adjusted R2 0.742
## Residual Std. Error 0.751 (df = 2200)
## F Statistic 337.264*** (df = 19; 2200)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
##
## ====
## TRUE
## ----
What we have found with the regression analysis is that average total purchases in AUS, GER and SP are lower than in the US. In average, AUS total purchases are 21.8% lower compared to the US. Same pattern for GER and SP, 27.8% and 16.9% lower.
To answer question, we create an indicator variable that equals 1 whether the amount spent on gold is higher than the average or not. Then regress the number of store purchases on this indicator variable. The mean of the amount spend on gold is 43.81.
summary(marketing_ready$MntGoldProds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 24.00 43.81 56.00 362.00
marketing_ready <- marketing_ready %>%
mutate(gold_above = as.factor(ifelse(MntGoldProds > 45, 1, 0)))
summary(marketing_ready$gold_above)
## 0 1
## 1549 671
above1 <- lm(NumStorePurchases ~ Age + Education + Marital_Status + Income + Kidhome + Teenhome + Country + AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + Response + gold_above, data = marketing_ready)
stargazer(above1, type = "text", no.space = TRUE, align = TRUE)
##
## ==================================================
## Dependent variable:
## ---------------------------
## NumStorePurchases
## --------------------------------------------------
## Age -0.011**
## (0.005)
## EducationBasic 0.071
## (0.352)
## EducationGraduation -0.138
## (0.173)
## EducationMaster -0.034
## (0.199)
## EducationPhD 0.026
## (0.194)
## Marital_StatusMarried 0.137
## (0.169)
## Marital_StatusSingle 0.017
## (0.182)
## Marital_StatusTogether -0.008
## (0.176)
## Marital_StatusWidow -0.011
## (0.298)
## Income 0.0001***
## (0.00000)
## Kidhome -1.081***
## (0.107)
## Teenhome 0.002
## (0.098)
## CountryCA 0.165
## (0.225)
## CountryGER 0.187
## (0.272)
## CountryIND 0.077
## (0.259)
## CountryME 0.097
## (1.310)
## CountrySA 0.307
## (0.216)
## CountrySP 0.289
## (0.191)
## CountryUS 0.438
## (0.280)
## AcceptedCmp11 -0.447**
## (0.225)
## AcceptedCmp21 0.546
## (0.446)
## AcceptedCmp31 -0.582***
## (0.192)
## AcceptedCmp41 0.506**
## (0.203)
## AcceptedCmp51 -0.814***
## (0.227)
## Response1 -0.583***
## (0.153)
## gold_above1 0.878***
## (0.118)
## Constant 1.753***
## (0.397)
## --------------------------------------------------
## Observations 2,220
## R2 0.528
## Adjusted R2 0.522
## Residual Std. Error 2.236 (df = 2193)
## F Statistic 94.207*** (df = 26; 2193)
## ==================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The result of the regression confirms the supervisor’s hypothesis. Among gold buyers, those who buy more than the average go more often to the store.
Lets now analyze question 4. To investigate that, I will run a regression interacting marital status and education. To answer it correctly, the reference category for marital status is “married” and for education “PhD”. In this sense, we can compare the categories with the other ones.
marketing_ready <- marketing_ready %>%
mutate(Marital_Status = relevel(Marital_Status, ref = "Married"),
Education = relevel(Education, ref = "PhD"))
inter1 <- lm(MntFishProducts ~ 0 + Age + Education + Marital_Status + Education*Marital_Status + Income + I(Income^2) + Kidhome + Teenhome + Country + AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + Response, data = marketing_ready)
stargazer(inter1, type = "text", no.space = TRUE, align = TRUE)
##
## ======================================================================
## Dependent variable:
## ---------------------------
## MntFishProducts
## ----------------------------------------------------------------------
## Age 0.039
## (0.086)
## EducationPhD -6.957
## (8.620)
## Education2n Cycle 13.651
## (8.828)
## EducationBasic 25.293**
## (11.293)
## EducationGraduation 11.571
## (7.983)
## EducationMaster -4.701
## (8.700)
## Marital_StatusDivorced 12.934**
## (6.465)
## Marital_StatusSingle -1.338
## (5.135)
## Marital_StatusTogether -3.500
## (4.877)
## Marital_StatusWidow 3.553
## (8.950)
## Income -0.00002
## (0.0002)
## I(Income2) 0.00000***
## (0.000)
## Kidhome -12.360***
## (1.956)
## Teenhome -18.226***
## (1.881)
## CountryCA -0.148
## (4.149)
## CountryGER -0.980
## (5.000)
## CountryIND 0.713
## (4.754)
## CountryME 32.870
## (24.019)
## CountrySA 3.831
## (3.979)
## CountrySP 0.773
## (3.512)
## CountryUS 1.292
## (5.138)
## AcceptedCmp11 15.009***
## (4.168)
## AcceptedCmp21 -11.646
## (8.192)
## AcceptedCmp31 -0.569
## (3.536)
## AcceptedCmp41 -19.617***
## (3.710)
## AcceptedCmp51 -28.824***
## (4.359)
## Response1 1.474
## (2.808)
## Education2n Cycle:Marital_StatusDivorced -18.771
## (11.813)
## EducationBasic:Marital_StatusDivorced -18.634
## (42.554)
## EducationGraduation:Marital_StatusDivorced -20.065***
## (7.727)
## EducationMaster:Marital_StatusDivorced -8.141
## (9.966)
## Education2n Cycle:Marital_StatusSingle -8.880
## (9.695)
## EducationBasic:Marital_StatusSingle -8.365
## (14.306)
## EducationGraduation:Marital_StatusSingle 1.416
## (6.064)
## EducationMaster:Marital_StatusSingle 9.494
## (7.798)
## Education2n Cycle:Marital_StatusTogether 28.614***
## (8.613)
## EducationBasic:Marital_StatusTogether -13.129
## (15.114)
## EducationGraduation:Marital_StatusTogether 5.623
## (5.803)
## EducationMaster:Marital_StatusTogether 10.993
## (7.219)
## Education2n Cycle:Marital_StatusWidow 51.207**
## (20.899)
## EducationBasic:Marital_StatusWidow -4.079
## (43.011)
## EducationGraduation:Marital_StatusWidow -1.078
## (11.450)
## EducationMaster:Marital_StatusWidow 39.894***
## (15.243)
## ----------------------------------------------------------------------
## Observations 2,220
## R2 0.624
## Adjusted R2 0.616
## Residual Std. Error 40.930 (df = 2177)
## F Statistic 83.899*** (df = 43; 2177)
## ======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The results of the regression tell us that married PhD candidates only buy more fish than graduated divorced participants. Widow participants tend to spend more in fish products. In general, divorced participants buy more fish too. Kids and teens at home decrease the amount spent in fish. Those who accepted offer campaign 2 buy more fish. Accepting campaigns 4 and5 spend less in fish than those who did not accept these campaigns.
To answer question 5, we can explore graphically the different campaigns.
par(mfrow = c(2,3))
plot(marketing_ready$AcceptedCmp1 ~ marketing_ready$Country, main = "Campaing 1", ylab = "Accepted Offer", xlab = "Country")
plot(marketing_ready$AcceptedCmp2 ~ marketing_ready$Country, main = "Campaing 2", ylab = NULL, xlab = "Country")
title(ylab = NULL)
plot(marketing_ready$AcceptedCmp3 ~ marketing_ready$Country, main = "Campaing 3", ylab = NULL, xlab = "Country")
plot(marketing_ready$AcceptedCmp4 ~ marketing_ready$Country, main = "Campaing 4", ylab = "Accepted Offer", xlab = "Country")
plot(marketing_ready$AcceptedCmp5 ~ marketing_ready$Country, main = "Campaing 5", ylab = NULL, xlab = "Country")
plot(marketing_ready$Response ~ marketing_ready$Country, main = "Last Campaign", ylab = NULL, xlab = "Country")
marketing_ready$offers <- ifelse(marketing_ready$AcceptedCmp1 == 1, "1st", 0)
marketing_ready$offers <- ifelse(marketing_ready$AcceptedCmp2 == 1, "2nd", marketing_ready$offers)
marketing_ready$offers <- ifelse(marketing_ready$AcceptedCmp3 == 1, "3rd", marketing_ready$offers)
marketing_ready$offers <- ifelse(marketing_ready$AcceptedCmp4 == 1, "4th", marketing_ready$offers)
marketing_ready$offers <- ifelse(marketing_ready$AcceptedCmp5 == 1, "5th", marketing_ready$offers)
marketing_ready$offers <- ifelse(marketing_ready$Response == 1, "Last", marketing_ready$offers)
marketing_ready$offers <- ifelse(marketing_ready$AcceptedCmp2 == 1, "2nd", marketing_ready$offers)
marketing_ready %>%
filter(offers != 0) %>%
ggplot(aes(offers)) +
geom_bar(fill = c("#24B0B7", "#39D97B", "#825DE3", "#DC8BE7", "#F31E1E", "#231985"), alpha = 0.8)
From the campaigns, the last one is the most successful one. Almost three times what others campaigns did. The company could apply another similar campaign to even increase the acceptance rate.
amountspent <- marketing %>%
dplyr::select(MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds) %>%
mutate(Wines = sum(MntWines),
Fruits = sum(MntFruits),
Meat = sum(MntMeatProducts),
Fish = sum(MntFishProducts),
Sweet = sum(MntSweetProducts),
Gold = sum(MntGoldProds)) %>%
dplyr::select(Wines, Fruits, Meat, Fish, Sweet, Gold)
amountspent <- unique(amountspent)
amountspent
## # A tibble: 1 x 6
## Wines Fruits Meat Fish Sweet Gold
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 680816 58917 373968 84057 60621 98609
amount <- data.frame(product = c("Wines", "Fruits", "Meat", "Fish", "Sweet", "Gold"), amount = c(680816, 58917, 373968, 84057, 60621, 98609))
amount %>%
ggplot(aes(amount, product)) +
geom_col(fill = c("#24B0B7", "#39D97B", "#825DE3", "#DC8BE7", "#F31E1E", "#231985"), alpha = 0.8)
Wines and meat are the channels that are sold the most. On the other hand, sweets and fruits are sold the least. The products that are not performing that well might be put in a bundle/pack to increase the quantity sold.
From this analysis, we can conclude that age, kids and teens at home, marketing campaign 1, 3, 5 and the last decrease the number of purchases in store. Whereas, income, campaigns 2 and 4, and products like meat, sweets and gold increase the purchases in store. For total purchases(i.e., the sum of amount spent on all goods) countries have no influence but marketing campaigns 3, 4 and last one affect significantly and positively the total amount spent per costumer in this company. Moreover, costumer who spent more than the average in gold also are more likely to buy in a store. However, married PhD costumer do not buy more fish. In fact, less educated consumers bought more. The 3rd, 4th and last campaign are the most successful ones as we found out before in the regression analysis. Creating bundles of products among those under-performing should increase the amount spent.