Offline Snipper

Building your custom offline Snipper

We also provide a template script allowing you to classify your data without an Internet connection, using a Naïve Bayes algorithm with genotype frequencies, on the off-chance that you need to do that. You can even modify this script to add further functionality to it at will. In order to use it, you will have to be familiar with R, the statistical programming environment, and how to install external packages in R.

We make no attempt to explain what goes on behind the scenes when you classify using a Naïve Bayes algorithm, other than saying that each test set individual will be assigned the population they are most likely to belong to. If you want to know more about Naïve Bayes, a wealth of information is already available out there (for instance, Wikipedia and machinelearningplus). In fact, the script we provide is based on that available in machinelearningplus. We will not chew the fat; we will lead you straight to your target!

First download and install R on your operating system. Additionally, you will have to install external packages klaR, openxlsx, caret and ggplot2, which can be done in different ways. If you want to do it from the command line, you will have to type something like

install.packages("foo", dependencies=TRUE)

for each of the packages. You can also install them all at the same time by typing

install.packages(c("klaR","openxlsx", "caret", "ggplot2"), dependencies=TRUE)

Alternatively, your R environment (e.g. R Studio) can allow you to install those packages in a WYSIWYG manner. If this is the case, just follow the instructions they provide.

Package tcltk is an optional install with R. However, you may have to install it manually.

At this point you should have readied your training set and test set in Excel files formatted in the way these examples are (1, 2, 3). Then you should run the following R script in order to get a classification. You will get both console output and a graph. More precisely, you will calculate frequencies for each marker in each population, likelihoods of each test set individual belonging to each population, classifications for each test set individual, and a graph summing up all results. Be aware that misclassifications can occur if there are populations that do not separate well enough.

# you must install packages klaR, openxlsx, caret, ggplot2 and tcltk
library(klaR)
library(openxlsx)
library(tcltk)

#read data (add more na strings like "N/N" if you use them)
training <- read.xlsx(tk_choose.files(caption='Choose training set file'), sheet=1,na.strings=c("NN", ""))
test <- read.xlsx(tk_choose.files(caption='Choose test set file'), sheet=1,na.strings=c("NN", ""))
#if you prefer to include your own files, do not include tcltk and rewrite the last two lines as
#training <- read.xlsx("training_set.xlsx", sheet=1,na.strings=c("NN", ""))
#test <- read.xlsx("test_set.xlsx", sheet=1,na.strings=c("NN", ""))

training[] <- lapply( training, factor) #convert training into factor
test[] <- lapply( test, factor)  #convert test into factor
nb_mod <- NaiveBayes(Population ~., data=training) 
pred <- suppressWarnings(predict(nb_mod, test))
nb_mod
pred
tab <- table(pred$class, test$Population)
caret::confusionMatrix(tab)
library(ggplot2)
test$pred <- pred$class
ggplot(test, aes(Population, pred, color = Population)) +geom_jitter(width = 0.2, height = 0.1, size=2)+
labs(title="Confusion Matrix", subtitle="Predicted vs. Observed from test set", y="Predicted", x="Truth")
# more examples of useful code
# dim(training[])[1] # gives you number of individuals in training set
# dim(training[])[2]-1 # gives you number of markers in training set
# levels(training[,'Population']) # gives you populations in training set
# dim(test[])[1] # gives you number of individuals in test set
# dim(test[])[2]-1 # gives you number of individuals in test set
# levels(test[,'Population']) # gives you populations in test set
# length(nb_mod$varnames) will also give you number of markers in training set
# nb_mod$apriori['PopA'] # gives you the ratio of number of PopA individuals over number of all training set individuals
# nb_mod$varnames  # gives you array of all marker names
# nb_mod$varnames[2] # gives you name of second marker
# nb_mod$tables$vWA  # gives you a table of frequencies of all alleles of marker vWA in each population
# nb_mod$tables$vWA['PopA',] # gives you frequencies of all alleles of marker vWA in PopA
# z<-nb_mod$tables$vWA['PopA',][which(nb_mod$tables$vWA['PopA',]>0)] # stores in z nonnull frequencies of all alleles of marker vWA in PopA
# Thus -sum(z*log(z,base=2)) would be the entropy of marker vWA in PopA
# See docs for an example of Jensen-Shannon divergence
# test$Population[5]  # gives you real population of fifth individual in test set
# pred$class[5]  # gives you predicted population of fifth individual in test set
# pred$posterior[5,] # gives likelihoods of fifth individual in test set for all populations
# sum(pred$posterior[5,])  # one is the sum of all likelihoods

Some examples

You can download an example of SNP classification, an example of haplotype classification, and an example of STR classification. For clarity, we also provide you with a copy of the output you should get when running each example. Just make the decompressed folder your default directory in R, and then copy and paste the provided R script on your R console.

You can also download an example of MLR (multilinear logistic regression) classification.

Bonus file: an R script to calculate divergence for the STR example, which you can tailor to your needs.