Hautaulogy
My Support Vector Machine Will Go On
09/30/13

The final project for a machine learning course I recently took entailed a Kaggle competition, the goal of which was to predict who survives the Titanic using historical data.

The project, my first Kaggle contest, was pretty fun! In this post, I'll briefly cover my most successful approach so far, using a Support Vector Machine model for prediction. The entire codebase for the project can be found in this GitHub repo.

The brunt of the work consisted of signal detection and feature selection on the training set, along with lots of iteration to see what yielded better prediction. In my data prep, I converted string variables such as "Sex" and "Cabin" to to boolean values. Next, I made a function to plot signal for the target variable in these new features, plotting R squared values and p-value using Chi-squared tests, as well as Pearson correlation.

  
    # Try to get at mutual information between target and logical variables
    
    signal.metrics <- function(data, columns) {
      ret.list <- list()
      for (x in columns) {
        if (any(data[,x]) > 0) {
          ret.list[[x]] <- list()
          ret.list[[x]][["chi.sq"]] <- chisq.test(data[,x],data$Survived)
          ret.list[[x]][["cor"]] <- cor(data[,x],data$Survived)
          ret.list[[x]][["conf.mat"]] <- matrix(
            # Confusion Matrix        
            c(nrow(data[which(data[,x]  == T & data$Survived == 1),]),
              nrow(data[which(data[,x]  == F & data$Survived == 1),]),
              nrow(data[which(data[,x]  == T & data$Survived == 0),]),
              nrow(data[which(data[,x]  == F & data$Survived == 0),])
            ),nrow=2,ncol=2
          )    
        }
      }
      
      return(ret.list)\n
    }
      
      plot.signal <- function(signal) {
      
        r.scores <- sapply(signal,function(x){
          return(x[["chi.sq"]]$statistic)
        })

        p.values <- sapply(signal,function(x){
          return(x[["chi.sq"]]$p.value)
        })
        
        chi.sq.results <- scale(data.frame(p.values,r.scores))
        cor.values <- sapply(signal,function(x){
          return(x[["cor"]])   
        })
        
        plot(chi.sq.results[,"r.scores"], type="b", xaxt="n",ylab="")
        axis(1, at=1:nrow(chi.sq.results), labels=names(signal))
        lines(chi.sq.results[,"p.values"], type='l', col="red")
        text(chi.sq.results[,"p.values"], as.character(format(p.values,digits=2)), col="purple")  
        lines(cor.values, type='l', col="orange")
        text(cor.values, as.character(format(cor.values,digits=2)), col="blue")
        
        legend(x=2,y=2,
          c("R Squared Values","P Values","Correlation"),
            lty=c(1,1,1),
            col=c("black","red","orange"))
            
      }
       
  

Based on Pearson correlation and statistical validity, the two features that had the most powerful signal was "Sex" and the ordinal "Fare" variable signifying price paid on tickets. Simply eye balling the "Fare" distributions on a box plot makes this apparent.

Simply put, being poor and male on the Titanic's final voyage, well...

Along with a few of other categorical features with weaker (but productive) signal, my first pass on prediction on the test set came shy of the benchmark at 0.77512.

Further gains would come from imputation of the "Age" variable, which was absent in a substantial number of observations. My strategy for imputation used two nominal variables as a proxy, predicting "Age" by using their associated age distributions. This imputed "Age" feature didn't help prediction itself, but binning for age groups yielded a predictive "Young" feature. Creating more features by binning "Fare" yielded another boost, which got me past the benchmark at 0.78469. Below are the functions I wrote to train and test the SVM model:

  
    titanic.svm.model <- function(data, variables) {
    
      target <- data$Survived
      model.train <- subset(data, select=c(variables,"Survived"))
      model.train[,variables] <- scale(model.train[,variables])
      
      model <- svm(Survived ~ female+Fare+first.class+third.class+Young+Fare1+Fare2+Fare3+Fare4, data=model.train)
      prediction.data <- subset(model.train, select=variables)
      pred <- predict(model, prediction.data)
      tab <- table(pred=round(pred),true=as.factor(target))

      return(list(model=model,conf.matrix=tab,performance=classAgreement(tab),predictions=pred))
    }
    
    test.survival.model <- function (test.model,variables,test.data,model.type="svm") {
    
      test.data.scaled <- scale(subset(test.data, select =c(variables)))
      test.data.scaled[which(is.na(test.data.scaled[,2])),2] <- median(test.data.scaled[,2],na.rm=T)
      if (model.type == "randomforest") {
        test.pred <- sapply(predict(test.model, test.data.scaled),function(x){
          return((x - 0.5) > 0)   
        })
      } else {
        test.pred <- predict(test.model, test.data.scaled)  
      }
        
      write.csv(cbind(PassengerId=test$PassengerId,Survived=round(test.pred)),
        paste("titanic_submission_",format(Sys.time(), "%m_%d_%y_%X.csv"),sep=""),
        row.names=F)
    }
  

Looking forward, I think I'll find further improvement by trying ensembling techniques, mixing in other predictive models along with SVM.

comments powered by Disqus