Hautaulogy
YouTube JSON API Objective-C Client, Without The Pain
04/17/14

Google/YouTube provide a great JSON API to do a variety of operations on data objects on the platform, i.e. videos, playlists, etc. It's a really awesome resource that makes YouTube a valuable asset for a number possible applications. However, taking advantage of the API was a major pain in a recent iOS project, and unnecessarily so! There doesn't seem to be much out there about how to properly setup the API for iOS projects— hence, this post!

While Google provides a number of client libraries for the JSON API, I was surprised how difficult the Objective-C client was to install. I never figured out how to "properly" add the client library to my XCode project, according to the confusing and seemingly outdated documentation on the topic.

Anybody wanting to install the YouTube library for iOS should simply user the iOS-GTLYouTube CocoaPod project on GitHub. brynbellomy figured out this mess, and now the library is available in a CocoaPod, so why would anyone do otherwise now?!

As of early April 2014, after you have the client library setup, make sure to NOT follow Google's instructions to include your project's bundle id in your app settings on the Google API App Console. Otherwise, the API will give a bad response. The issue is being tracked on Google's issue tracker (which is kind of an eye sore), and hopefully will resolve soon. In the meantime, don't let anybody get your access token, or somebody might get a free ride of API requests on your back!

Broken Arrow
10/22/13

I recently started a Rails project that involves web page scrapping to acquire data. The scrapping process uses the Nokogiri gem to fetch HTML as Nokogiri::HTML::Document objects, which can be manipulated and stored as data in my database— so far, so good!

But it wasn't good. Certain numerical data depended on the presence of multibyte characters in the document text, specifically "↑" and "↓". This would indicate that integers were either positive or negative.

  
    string = "↓1"

    # Turn tiny arrows into operators so I can parse strings as integers!

    # plus
    string.gsub!("↑","+")

    # minus
    string.gsub!("↓","-")

    # "↓1" should become "-1"
    integer = string.to_i

  

However, initializing the application produced this error:

  
    21:59:05 resque.1 | rake aborted!
    21:59:05 resque.1 | .../ruby-1.9.3-p448@global/gems/rake-10.1.0/lib/rake/traceoutput.rb:16:in `block in traceon': invalid byte sequence in US-ASCII (ArgumentError)
  

In the words of Don LaFontaine:

  
  ["↑","↓"].include?(@enemy)
  

There were dependencies in the codebase that couldn't handle non-ASCII characters. Thus began my short and unexpected adventure into character encoding in Ruby— a brief chapter of the ole' Pickaxe (Chapter 17, "Character Encoding" is unfortunately not available in the free web extracts).

Use of non-ASCII characters in Ruby can be allowed by specifying a different encoding via a "magic comment" at the top of a script, like so:

  
    #encoding: ISO-8891-1
    puts "Olé!"
  

I eventually turned impatient sifting for an alternative encoding that would work for "↑" and "↓". Instead of using a "magic comment", I decided to see if I could parse the string object as an HTML entity.

One of the joys of programming in Ruby is that there is a plethora of open source libraries available to tackle confounding problems such as these. The HTMLEntities gem offers a powerful and convenient Swiss Army Knife to parse HTML entities as you like.

My predicament with "↑" and "↓" was over. Phew!

  
    string = "↓1"
    # Encode tiny arrows as HTML entities if they're present.
    coder = HTMLEntities.new
    string = coder.encode(string, :hexadecimal)
  
    # Turn encoded tiny arrows into operators so I can parse strings as integers!
    # plus
    string.gsub!("↑","+")
    
    # minus
    string.gsub!("↓","-")

    # "↓1" should become "-1" and not break anything!
    integer = string.to_i
  

My Support Vector Machine Will Go On
09/30/13

The final project for a machine learning course I recently took entailed a Kaggle competition, the goal of which was to predict who survives the Titanic using historical data.

The project, my first Kaggle contest, was pretty fun! In this post, I'll briefly cover my most successful approach so far, using a Support Vector Machine model for prediction. The entire codebase for the project can be found in this GitHub repo.

The brunt of the work consisted of signal detection and feature selection on the training set, along with lots of iteration to see what yielded better prediction. In my data prep, I converted string variables such as "Sex" and "Cabin" to to boolean values. Next, I made a function to plot signal for the target variable in these new features, plotting R squared values and p-value using Chi-squared tests, as well as Pearson correlation.

  
    # Try to get at mutual information between target and logical variables
    
    signal.metrics <- function(data, columns) {
      ret.list <- list()
      for (x in columns) {
        if (any(data[,x]) > 0) {
          ret.list[[x]] <- list()
          ret.list[[x]][["chi.sq"]] <- chisq.test(data[,x],data$Survived)
          ret.list[[x]][["cor"]] <- cor(data[,x],data$Survived)
          ret.list[[x]][["conf.mat"]] <- matrix(
            # Confusion Matrix        
            c(nrow(data[which(data[,x]  == T & data$Survived == 1),]),
              nrow(data[which(data[,x]  == F & data$Survived == 1),]),
              nrow(data[which(data[,x]  == T & data$Survived == 0),]),
              nrow(data[which(data[,x]  == F & data$Survived == 0),])
            ),nrow=2,ncol=2
          )    
        }
      }
      
      return(ret.list)\n
    }
      
      plot.signal <- function(signal) {
      
        r.scores <- sapply(signal,function(x){
          return(x[["chi.sq"]]$statistic)
        })

        p.values <- sapply(signal,function(x){
          return(x[["chi.sq"]]$p.value)
        })
        
        chi.sq.results <- scale(data.frame(p.values,r.scores))
        cor.values <- sapply(signal,function(x){
          return(x[["cor"]])   
        })
        
        plot(chi.sq.results[,"r.scores"], type="b", xaxt="n",ylab="")
        axis(1, at=1:nrow(chi.sq.results), labels=names(signal))
        lines(chi.sq.results[,"p.values"], type='l', col="red")
        text(chi.sq.results[,"p.values"], as.character(format(p.values,digits=2)), col="purple")  
        lines(cor.values, type='l', col="orange")
        text(cor.values, as.character(format(cor.values,digits=2)), col="blue")
        
        legend(x=2,y=2,
          c("R Squared Values","P Values","Correlation"),
            lty=c(1,1,1),
            col=c("black","red","orange"))
            
      }
       
  

Based on Pearson correlation and statistical validity, the two features that had the most powerful signal was "Sex" and the ordinal "Fare" variable signifying price paid on tickets. Simply eye balling the "Fare" distributions on a box plot makes this apparent.

Simply put, being poor and male on the Titanic's final voyage, well...

Along with a few of other categorical features with weaker (but productive) signal, my first pass on prediction on the test set came shy of the benchmark at 0.77512.

Further gains would come from imputation of the "Age" variable, which was absent in a substantial number of observations. My strategy for imputation used two nominal variables as a proxy, predicting "Age" by using their associated age distributions. This imputed "Age" feature didn't help prediction itself, but binning for age groups yielded a predictive "Young" feature. Creating more features by binning "Fare" yielded another boost, which got me past the benchmark at 0.78469. Below are the functions I wrote to train and test the SVM model:

  
    titanic.svm.model <- function(data, variables) {
    
      target <- data$Survived
      model.train <- subset(data, select=c(variables,"Survived"))
      model.train[,variables] <- scale(model.train[,variables])
      
      model <- svm(Survived ~ female+Fare+first.class+third.class+Young+Fare1+Fare2+Fare3+Fare4, data=model.train)
      prediction.data <- subset(model.train, select=variables)
      pred <- predict(model, prediction.data)
      tab <- table(pred=round(pred),true=as.factor(target))

      return(list(model=model,conf.matrix=tab,performance=classAgreement(tab),predictions=pred))
    }
    
    test.survival.model <- function (test.model,variables,test.data,model.type="svm") {
    
      test.data.scaled <- scale(subset(test.data, select =c(variables)))
      test.data.scaled[which(is.na(test.data.scaled[,2])),2] <- median(test.data.scaled[,2],na.rm=T)
      if (model.type == "randomforest") {
        test.pred <- sapply(predict(test.model, test.data.scaled),function(x){
          return((x - 0.5) > 0)   
        })
      } else {
        test.pred <- predict(test.model, test.data.scaled)  
      }
        
      write.csv(cbind(PassengerId=test$PassengerId,Survived=round(test.pred)),
        paste("titanic_submission_",format(Sys.time(), "%m_%d_%y_%X.csv"),sep=""),
        row.names=F)
    }
  

Looking forward, I think I'll find further improvement by trying ensembling techniques, mixing in other predictive models along with SVM.