Deciphering Kickstarter Success: Employing Random Forest and Logistic Regression through Machine Learning

Introduction

In recent times, crowdfunding platforms have reshaped the landscape
for innovators, artists, and entrepreneurs in their pursuit of financial
support for their creative endeavors. Among these platforms, Kickstarter
stands out as a trailblazer, offering project creators a direct
connection to a global audience and the opportunity to fund their ideas.
Nevertheless, the path to success on Kickstarter is anything but
assured, with many campaigns falling short of their funding targets.

This study sets out to explore the intricate factors that influence
the fate of Kickstarter projects. By harnessing the capabilities of
machine learning, we’ve opted to analyze these factors using two widely
used algorithms: Random Forest and Logistic Regression. Random Forest’s
adeptness at handling intricate and interrelated data, coupled with
Logistic Regression’s proficiency in predicting binary outcomes, equips
us to delve deep into the complexities of project attributes, campaign
dynamics, and external variables that impact Kickstarter project
outcomes.

Through an in-depth examination of a diverse dataset spanning past
Kickstarter campaigns, our aim is to uncover patterns, trends, and
critical predictors that could enhance the likelihood of project
creators reaching their funding goals. This research not only
contributes significantly to the understanding of crowdfunding dynamics
but also aims to offer actionable insights for project creators looking
to launch successful campaigns on Kickstarter and similar platforms.
Ultimately, our findings seek to illuminate the nuances of crowdfunding
success, fostering an environment conducive to innovation and creativity
in the digital era.

library(tidyverse)
library(tidymodels)
library(ranger)
library(broom)

Loading the Data

mydt<-read.csv('ks-projects-201801.csv', header = T)
mydt$state<-as.factor(mydt$state)
levels(mydt$state)
## [1] "canceled"   "failed"     "live"       "successful" "suspended" 
## [6] "undefined"

Out of the six State levels (Failed, Successful, Cancelled, Live,
Suspended, and Undefined), our focus will be solely on projects
categorized as Successful and Failed.

mydt1<-mydt%>%filter(state == "failed" | state == "successful") %>% droplevels

Let’s examine the currency variable.

mydt1$currency<-as.factor(mydt1$currency)
round(prop.table(table(mydt1$currency)),2)
## 
##  AUD  CAD  CHF  DKK  EUR  GBP  HKD  JPY  MXN  NOK  NZD  SEK  SGD  USD 
## 0.02 0.04 0.00 0.00 0.04 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.79

Approximately 79% of the total amount is denominated in US dollars,
followed by 9% in British pounds, and 4% in euros. Together, USD and GBP
account for over 88% of the currency distribution. Therefore, we will
categorize the data into USD, GBP, and Other currencies.

## Transforming Currency variable into 4 levels
levels(mydt1$currency) <- 
  list(GBP="GBP", EUR="EUR",USD = "USD",
       other = c("MXN", "NZD", 
                                         "SEK","AUD","CAD","JPY", "HKD", "SGD", "CHF", "NOK", "DKK"))

levels(mydt1$currency)
## [1] "GBP"   "EUR"   "USD"   "other"

Examining both the project’s launch date and completion date, we can
create a new variable known as ‘duration.’ This variable reflects the
time taken for the project to either succeed or fail.

mydt1$deadline <- as.Date(mydt1$deadline)
mydt1$launched <- as.Date(mydt1$launched)
mydt2<-mydt1%>%mutate(Duration=deadline-launched)
mydt2$Duration<-as.numeric(mydt2$Duration)

Let’s examine the country variable.

mydt2$country<-as.factor(mydt2$country)
round(prop.table(table(mydt2$country)),2)
## 
##   AT   AU   BE   CA   CH   DE   DK   ES   FR   GB   HK   IE   IT   JP   LU   MX 
## 0.00 0.02 0.00 0.04 0.00 0.01 0.00 0.01 0.01 0.09 0.00 0.00 0.01 0.00 0.00 0.00 
## N,0"   NL   NO   NZ   SE   SG   US 
## 0.00 0.01 0.00 0.00 0.00 0.00 0.79

We assume that the impact of the Country variable on our model will
be similar to that of the currency variable, given its distribution.
Consequently, the country variable will not be incorporated into our
modeling. Now, let’s examine the percentage of missing values in our
data.

mean(!complete.cases(mydt2))
## [1] 0.0006331499
mydt2<-na.omit(mydt2)

chosen variables for analysis

data<-mydt2%>%select(main_category,currency,state,backers,usd_pledged_real,usd_goal_real,Duration)

Exploring tge Data Set

p1 <- ggplot(data, aes(x = log(backers), y = log(usd_pledged_real), color=state)) +
  geom_jitter()  +
  ylab("USD Pledged(log)")+ xlab("Backers(log)") +
  ggtitle("The Amount of USD Pledged vs Backers")
p1

There exists a correlation between the pledged amount in USD and the
number of backers for both failed and successful projects on
Kickstarter.

 p2<-ggplot(data, aes(log(usd_goal_real),  fill = state)) +
  geom_boxplot() +
  xlab("USD Goal(log))") + ylab("") +
  ggtitle("USD Goal for projects")+theme_bw()
p2

Based on the boxplot, unsuccessful Kickstarter projects tend to have
a higher average USD goal compared to the average USD pledged for
successful projects. Essentially, most failed projects seem to aim for a
higher USD goal.

p3 <- ggplot(data, aes(x = currency, fill = state)) +
  geom_bar(position = "dodge") +
  ylab("Number of projects") + xlab("") +
  ggtitle("Number of Projects Based on Currency")+theme_bw()
p3

The majority of both successful and failed Kickstarter projects were
funded in USD, with only a few funded in GBP, EUR, and other
currencies.

p4 <- ggplot(data, aes(x = main_category, fill = state)) +
  geom_bar(position = 'dodge')  +
  ylab("Number of projects") + xlab("") +scale_x_discrete(guide = guide_axis(angle = 90)) +
  ggtitle("Number of Projects Based on Categories")+theme_bw()
p4

Art, Games, Film & Video, Music, and Publishing boast the highest
number of successful projects.

p5<- ggplot(data, aes(x = log(backers), y = log(usd_goal_real), color=state)) +
  geom_jitter()  +
  ylab("USD Goal(log)") + xlab("Backers(log)") +
  ggtitle("USD Goal vs Backers")
p5

Both failed and successful Kickstarter projects exhibit a positive
correlation between the USD goal and the number of backers.

 p6<-ggplot(data, aes(log(usd_pledged_real),  fill = state)) +
  geom_boxplot() +
  xlab("USD pledged (log)") + ylab("") +
  ggtitle("USD pledged")+theme_bw()
p6

Successful projects on Kickstarter typically receive a higher average
amount of committed funds compared to unsuccessful ones.

Modelling our Data

Dividing the data into Train and Test

set.seed(10000)
data2 <- initial_split(data, prop = 0.60, strata = state)
trainset <- training(data2 )
testset <- testing(data2 )

We will create a training set and a test set randomly from the data
using the ‘initial split’ function. The proportion of the training set
utilized in the test set can be determined by specifying the proportion.
To maintain consistency in the proportion of each ‘state’ level, we will
incorporate the ‘Strata’ parameter in both the training and test
sets.

The Data Pre-Processing

Before proceeding, let’s preprocess our data by converting it into
dummy variables, as well as scaling and centering it

data_recipe<- recipe(state~., data = trainset)%>%
  step_dummy(all_nominal_predictors(), one_hot = T)%>%
  step_center(all_numeric_predictors())%>%
  step_scale(all_numeric_predictors())

In our case, we will specify the number of folds to be used during
training as 5 folds, to be repeated once.

cv_ks<- vfold_cv(trainset, v = 5, repeats = 1, strata = state)
measure <- metric_set(accuracy) 

Machine Learning Models.

We will fit two machine learning models: Logistic Regression and
Random Forest. These models are recognized for their superior
performance in classification tasks.

Logistic Regression Model

log_mod <- logistic_reg() %>% 
  set_engine("glm")%>%
  set_mode("classification")

logistic_model_fit  <- 
  workflow() %>% 
  add_model(log_mod) %>% 
  add_recipe(data_recipe) %>%
  fit(trainset)

Let’s evaluate the performance of logistic regression on the test
dataset.

logistic_model_fit    %>%
  augment(testset, type.predict = "response") %>%
  accuracy(state, .pred_class)
log_class<- predict(logistic_model_fit  , new_data = testset,
                       type = 'class')
prob_preds <- predict(logistic_model_fit  , new_data = testset, 
                      type = 'prob')
logistic_res<- testset %>% 
  select(state) %>% 
  bind_cols(log_class, prob_preds)

logistic_res%>% 
  # Plot ROC curve
  roc_curve(truth = state, .pred_failed) %>% 
  autoplot()

An accuracy of 99.8% on logistic regression? That’s incredible!

Random Forest Model

rf_mod<- 
  rand_forest(mtry = tune(),
              trees = tune(),
              min_n = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

rf_wf<- 
  workflow() %>% 
  add_recipe(data_recipe) %>% 
  add_model(rf_mod)

r_grid<- grid_random(
 mtry() %>% range_set(c( 1,  5)),
  trees() %>% range_set(c( 100, 150)), 
  min_n() %>% range_set(c(2,  8)),
  size = 10)

tune_random <- 
  rf_wf %>% 
  tune_grid(
    resamples = cv_ks, 
    grid = r_grid, 
    ##control = ctrl, 
    metrics = measure)
show_best(tune_random)

After parameter tuning, the random forest model with the highest
accuracy is the one with mtry=5, trees=117, and min_n=4.

rfo <- rf_wf%>%
  finalize_workflow(select_best(tune_random )) %>%
  fit(trainset)
rfo%>%
  augment(testset, type.predict = "response") %>%
  accuracy(state, .pred_class)
rf_class<- predict(rfo, new_data = testset,
                       type = 'class')
rf_prob <- predict(rfo , new_data = testset, 
                      type = 'prob')
rf_main <- testset %>% 
  select(state) %>% 
  bind_cols(rf_class, rf_prob)

rf_main%>% 
  # Plot ROC curve
  roc_curve(truth = state, .pred_failed) %>% 
  autoplot()

The accuracy for the random forest model was 99.6%

logistic_model_fit %>%
  tidy() %>%
  mutate(term =  ordered(term)|> fct_reorder(p.value)) %>%
  ggplot(aes(p.value, term, fill = p.value < 0.05)) +
  geom_col() +
  theme(legend.position = "none")

Categories like Publishing, Photography, Crafts, Art, Games, Design,
and Food demonstrate high success rates in securing funding on
Kickstarter.

Conclusion

In this project, Logistic Regression demonstrated superior
performance compared to the Random Forest model. The most funded
categories include Publishing, Photography, Crafts, Art, Games, Design,
and Food.

Leave a Reply

Your email address will not be published. Required fields are marked *