Database Reflection using dplyr

At work I write a ton of SQL, and I do most of my querying using R.  The workflow goes:

  1. Create a string with the SQL in R
  2. Plug the string into fetchQuery (see my previous post)

This solution works relatively well, but i’m a bit unhappy writing strings rather than using function calls.

I began working on my own ORM implementation, but it was very slow-go and it would have taken a lot of time to get anywhere.  Luckily, I was pleasantly surprised that Hadley Wickham’s new dplyr package implements much of the ORM I was hoping for.

One thing I want out of an ORM is the ability to see every table in our databases all at once.  That functionality, while implement-able using the dplyr package, would likely take quite a while on tens of thousands of tables.  So I decided to implement the reflection myself.

 

 

#' Get the table information for a postgres database
#' @param config the configuration list
#' @return the table names, columns, and column types of all columns in the database
getTableInformation <- function(config = config.gp) {
  tables <- fetchQuery(
    "SELECT table_name, column_name, data_type 
    FROM information_schema.columns 
    WHERE table_name NOT LIKE '%prt%' 
      AND table_name NOT LIKE '%ext%' 
      AND table_name NOT LIKE '%tmp%' 
    ORDER BY 1, 2",
    config
  )
}

#' Replacement of the normal update function, you don't need to call this.
update <- function(object, ...) {
    args <- list(...)
    for (nm in names(args)) {
        object[[nm]] <- args[[nm]]
    }
    if (is.null(object$select)) {
        if (is.ident(object$from)) {
            var_names <- object$select
        }
        else {
            var_names <- qry_fields(object$src$con, object$from)
        }
        vars <- lapply(var_names, as.name)
        object$select <- vars
    }
    object$query <- dplyr:::build_query(object)
    object
}
#' Function to reflect a database, generalizable to others beyond postgres 
#' by simply changing getTableInformation appropriately
reflectDatabase <- function(config, envir.name = "tables",
                            subclass = "postgres") {
  if (!(envir.name %in% search())) {
    envir <- new.env(parent = .GlobalEnv)
  } else {
    envir <- as.environment(envir.name)
    detach(envir.name, character.only = TRUE)
  }
  src <- do.call(src_postgres, config)
  tables <- getTableInformation(config)
  tables <- split(tables, tables$table_name)
  lapply(tables, function(i) {
    nm <- ident(i$table_name[1])
    vars <- lapply(i$column_name, as.name)
    tbl <- dplyr::make_tbl(c(subclass, "sql"), src = src, from = nm,
                    select = vars, summarise = FALSE, mutate = FALSE,
                    where = NULL, group_by = NULL, order_by = NULL)
    tbl <- update(tbl)
    assign(
      nm,
      tbl,
      envir = envir
    )
  })
  attach(envir, name = envir.name)
}

searchTables <- function(str, env = "tables") {
  all.tbls <- ls(env)
  all.tbls[grep(str, all.tbls)]
}

To use this function, you can simply call (with your config, as specified in my last post, included)

reflectDatabase(config)

and if you’re using a Postgres database, that should be it!

The fun part now, is that I can do things like

res <- inner_join(my_table_1, my_table_2)

where my_table_1 and my_table_2 are simply names of tables in my database. This provides me with auto-complete of table names, search-able table names and columns, etc.

For example:

searchTables('user')

returns all tables in our database with the string “user” in them.

These are some things I hope to see or find in dplyr, and may try to build myself if they don’t already exist:
1. Case statements in mutate
2. Creating table indexes
3. type checking of columns, and more informative error messages when un-sensible joins and filters are performed.

Overall this package seems like a lot of fun, and i’m excited to try to work it into my coding!

Easier Database Querying with R

I have a strong distaste for database connection management.  All I want to do when I want to query one of our many databases at work is to simply supply the query, and package the result into an R data.frame or data.table.

R has many great database connection tools, including but not limited to RPostgreSQL, RMySQL, RDJBC, ROracle, and RODCBC.  I set out to consolidate all of my database querying into a simple function, and I have succeeded for my purposes.  I can connect to my company’s MySQL, Postgres, and Oracle databases with ease with my fetchQuery function, defined (in conjunction with it’s dependencies) as follows.


#' Make a connection to a database
#' 
#' This function abstracts the idea of a database connection, allowing variable parameters 
#' depending on the type of database you're connecting to
#'@param config a named list of the configuration options for the database connection
#'@return a connection to the database defined in the config
#'@author Erik Gregory
makeCxn <- function(config) {
  if (class(config[['drv']]) == "character") {
    config[['drv']] <- dbDriver(config[['drv']])
  }
  do.call(dbConnect, config)
}

#' This function runs a query on a database, fetching the result if desired
#' 
#' The purpose of this function is to remove connection management from the querying process
#' @param query the query you want to make to the SQL connection you've specified
#' @param config a named list of the configuration options for the connection
#' @param n the number of rows to return, or -1 for all rows
#' @param verbose Should the queries be printed as they're made?
#' @param split Should the queries be split on semicolons, or run as a block?
#' @return A list of results if multiple queries, or a single result if one query.
#' @author Erik Gregory
fetchQuery <- function(query, config = config.gp, split = FALSE, verbose = TRUE, n = -1) {
  res <- list()
  cxn <- makeCxn(config)
  t1 <- Sys.time()
  queries <- query
  if (split == TRUE) {
    queries <- strsplit(query, ";", fixed = TRUE)[[1]] # Split the query into components
  }
  for (item in queries) {
    if(verbose) {
      cat(paste(item, '\n'))
    }
    tmp <- try(dbSendQuery(cxn, item)) # send the query
    if ('try-error' %in% class(tmp)) {
      res[[item]] <- dbGetException(cxn)
      next
    }
    type <- tolower(substring(gsub(" ", "", item), 0, 6)) # identify if select, insert, delete
    if (type == "select" | grepl("with..", type) | grepl('EXPLAI|explai', type) | !split) {
      res[[item]] <- try(fetch(tmp, n))
    }
    else {
      res[[item]] <- dbGetRowsAffected(tmp)
      cat(res[[item]])
    }
    if (verbose) {
      print(Sys.time() - t1)
      if (!is.null(dim(res))) {
        print(dim(res))
      }
    }
    dbClearResult(tmp)
  }
  dbDisconnect(cxn)
  if (length(res) == 1) {
    res <- res[[1]]
  }
  res
}


I set my default config parameter to fetchQuery to be my most commonly used connection. I define my connections in my file at ~/.Rprofile file. The effect of this is that I always have the configuration information in memory whenever I need them. An example is as follows:

config.gp <- list(
    user = "username",
    password = "password",
    dbname = "MY_DBNAME",
    host = "url_of_host",
    port = port_of_host,
    drv = "PostgreSQL"
)

config.gp.admin <- list(
    user = "username_of_admin",
    password = "password_of_admin",
    dbname = "MY_DBNAME",
    host = "url_of_host",
    port = port_of_host,
    drv = "PostgreSQL"
)
config.mysql <- list(
    user = "username",
    password = "password",
    dbname = "MY_DBNAME",
    host = "MY_HOST_IP_ADDRESS",
    drv = "MySQL"
)
config.whse <- list(
  drv = JDBC("oracle.jdbc.OracleDriver", "/usr/lib/oracle/instantclient_11_2/ojdbc5.jar"),
  user = "username",
  password = "password",
  url =  "url"
)

Notice that the drv (driver) argument can be either an actual driver, or a character string of the driver type. My reason for this is that some driver initializations require multiple parameters, while some only require a single one. This could be made elegant by using a do.call argument, as defined in makeCxn above.

It is important that the connection lists defined in the ~/.Rprofile file

  1. Have the arguments you want to pass to dbConnect, named according to the value they correspond to in dbConnect
  2. Have no extra arguments that you don’t want to pass to dbConnect

 

Some explanations behind the reasoning of the arguments and methods of fetchQuery:

  • Sometimes I want to run a bunch of queries all contained in one character string to my databases. This function will either split those queries by semicolons, or run them all in one batch depending on what you ask it to do. The advantage of the former is you will have diagnostics for each of the intermediary queries (temporary table creations, table deletions or inserts, …).
  • I usually want to see how my query is doing as it’s running, so I provide a verbosity option.
  • fetchQuery attempts to auto-detect whether you’re doing an insert or deletion or selection, and returns a result appropriate to the operation. This algorithm is simple, crude string-matching at this point, and i’d be happy to see an improvement. It hasn’t been a problem for me yet since I am very consistent in my sql syntax.

So, whenever I want to run an oracle query i’ll run something like:

res <- fetchQuery("SELECT * FROM table_name", config.whse)

or if I want to run a query as an admin against our Postgres database

res <- fetchQuery("SELECT * FROM table_name limit 10", config.gp.admin)

The connection fortunately gets closed no matter if the query is in error or not, which is really nice for me and my company’s DBAs (R will limit the number of active database connections you can have open, so it is important to close them).

A Sequence Clustering Model in R

I’ve just released my first R package!
Over the past 1.5 years or so, I’ve been studying an obscure statistical model over ranking (full, or partial) data called Mallows’ model.  It hypothesizes that a set of sequence data has a “modal” sequence about which the data cluster, and that the data fall away from that modal sequence with probability \exp{-\lambda}, where \lambda is derived from the data.

There is a closed form solution to determining the \lambda parameter and the modal sequence, as outlined in my third reference in the package manual . However, we often have data that are heterogeneous, and it makes more sense to hypothesize multiple modal sequences. That is where this package comes in.

The RMallow package can take any set of sequence data of reasonable size (currently I have gotten it to process data sets with 2000 + rows and 20 + attributes in under a minute), and fit a Multi-Modal Mallows’ Model to the data using an EM algorithm as described in my first reference in the package summary.

Certain conditions have to be met for the model to fit very well. For example, if there happen to be multiple modal sequences, but they are very close together (as determined by a distance metric called Kendall’s metric, which is equivalent to Bubble Sort in a way), Mallows’ model may simply group them all into the same “cluster.” Here is an example where I fit the model to some crude synthetic data, and discover two constructed modal sequences. The sequences are no more than distance 20 apart (the maximum possible distance in this space is 20*(20-1)/2 = 190), and it still groups the data very well!

The idea of the construction of the data in the following is that we create two modal sequences for the code to find. I then construct a good number of identical sequences to the modal ones, and randomly transpose between 1 and 10 of the elements of each (uniformly, which I stress is NOT how Mallows’ model works…). I then add a good deal of noise in partial rankings (rankings where there are no full orderings between certain groups). The final call is the code to fit the model.

# Some synthetic data to represent the capability of RMallow
library(RMallow)
modal <- 1:20
second.modal <- c(1, 2, 3, 4, 5, 17, 7, 9, 8, 11, 10, 
                 12, 13, 14, 15, 16, 6, 18, 19, 20)
partial.noise <- list()
modal.noise <- list()
second.noise <- list()
for (i in 1:300) {
  # Partial ranking noise
  partial.noise[[i]] <- sample(20, replace = TRUE)
  # Modal sequence 1
  modal.noise[[i]] <- modal
  # Modal sequence 2
  second.noise[[i]] <- second.modal
  # Choose to switch between 1 and 10 random indicies
  swit <- sample(20, sample(10, 1))
  swit.2 <- sample(20, sample(10, 1))
  modal.noise[[i]][swit] <- modal.noise[[i]][rev(swit)]
  second.noise[[i]][swit.2] <- second.modal[rev(swit.2)]
}
partial.noise <- do.call("rbind", partial.noise)
partial.noise <- SimplifySequences(partial.noise)
modal.noise <- do.call("rbind", modal.noise)
second.noise <- do.call("rbind", second.noise)
datas <- do.call("rbind", list(modal.noise, partial.noise, second.noise))
datas <- datas[sample(nrow(datas)), ]
test <- Mallows(datas, iter = 100, G = 2)

Now for the moment of truth, to determine whether the code found the sequences. The output is a list, where the first element is the modal sequences found, second is the proportion of data assigned to each cluster, third is the cluster membership probabilities for each row, and fourth a data frame with the data, probabilities of cluster memberships, cluster assignments, and distance from each modal sequence.

> test[[1]]
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

[[2]]
 [1]  1  2  3  4  5 17  7  9  8 11 10 12 13 14 15 16  6 18 19 20

GREAT SUCCESS! I purposely did not set a seed here because when I ran the code above, it actually did not produce what I was hoping for. I DID get that output when I ran a function I will put into the next release, which runs the algorithm several times and selects the best fit, based on the log-likelihood of the data in the model. EM Algorithms are prone to local maxima, so this is a necessary fix. You may very well get the output above when you run that code on your machine, depending on the state of your system. If not, try running it 5 times or so and it will most likely work out!

So there we have it. I have demonstrated that this package has some value in mining patterns from sequence data with noise. I will be working on implementing some solutions to increase the probability of convergence, and trying to extend the model to larger sequence spaces soon. Also, I will be providing a Goodness Of Fit Assessment of the model, when our data satisfies certain conditions (sample size, …). Finally, I hope to soon be able to demonstrate the value of this model in real-world data in my next post.

A spell-checker in R

I came across Dr. Peter Norvig’s blog about writing a basic spell-checker (http://norvig.com/spell-correct.html), and just had to try to implement it in R. Please excuse the ugly-ish code (I have not optimized it or commented it adequately at this point, but you can get the idea of what it does by reading Dr. Norvig’s blog). If anyone knows of any pre-built spell-checker packages in R, please let me know in a comment!

I do not think R is a particularly good language for this sort of activity, but I got it to work out fine. The first few lines here create a list of common words, and their frequencies in the English language. The following lines may take a few minutes to run on an average machine, but I will try to upload them soon so that you can just download the table instead of creating it yourself…

words <- scan("http://norvig.com/big.txt", what = character())
words <- strip.text(words)
counts <- table(words)

Next, here are the functions we need to do the spell-check operations…

# This is a text processing function, which I
# borrowed from a CMU Data mining course professor.
strip.text <- function(txt) {
  # remove apostrophes (so "don't" -> "dont", "Jane's" -> "Janes", etc.)
  txt <- gsub("'","",txt)
  # convert to lowercase
  txt <- tolower(txt)
  # change other non-alphanumeric characters to spaces
  txt <- gsub("[^a-z0-9]"," ",txt)
  # change digits to #
  txt <- gsub("[0-9]+"," ",txt)
  # split and make one vector
  txt <- unlist(strsplit(txt," "))
  # remove empty words
  txt <- txt[txt != ""]
  return(txt)
}

# Words within 1 transposition.
Transpositions <- function(word = FALSE) {
  N <- nchar(word)
  if (N > 2) {
    out <- rep(word, N - 1)
    word <- unlist(strsplit(word, NULL))
    # Permutations of the letters
    perms <- matrix(c(1:(N - 1), 2:N), ncol = 2)
    reversed <- perms[, 2:1]
    trans.words <- matrix(rep(word, N - 1), byrow = TRUE, nrow = N - 1)
    for(i in 1:(N - 1)) {
      trans.words[i, perms[i, ]] <- trans.words[i, reversed[i, ]]
      out[i] <- paste(trans.words[i, ], collapse = "")
    }
  }
  else if (N == 2) {
    out <- paste(word[2:1], collapse = "")
  }
  else {
    out <- paste(word, collapse = "")
  }
  return(out)
}

# Single letter deletions.
# Thanks to luiscarlosmr for partial correction in comments
Deletes <- function(word = FALSE) {
  N <- nchar(word)
  out<-mat.or.vec(1,N)
  word <- unlist(strsplit(word, NULL))
  for(i in 1:N) {
    out[i] <- paste(word[-i], collapse = "")
  }
  return(out)
}

# Single-letter insertions.
Insertions <- function(word = FALSE) {
  N <- nchar(word) 
  out <- list()
  for (letter in letters) {
    out[[letter]] <- rep(word, N + 1)
    for (i in 1:(N + 1)) {
      out[[letter]][i] <- paste(substr(word, i - N, i - 1), letter, 
                                substr(word, i, N), sep = "")
    }
  }
  out <- unlist(out)
  return(out)
}

# Single-letter replacements.
Replaces <- function(word = FALSE) {
  N <- nchar(word) 
  out <- list()
  for (letter in letters) {
    out[[letter]] <- rep(word, N)
    for (i in 1:N) {
      out[[letter]][i] <- paste(substr(word, i - N, i - 1), letter, 
                                substr(word, i + 1, N + 1), sep = "")
    }
  }
  out <- unlist(out)
  return(out)
}
# All Neighbors with distance "1"
Neighbors <- function(word) {
  neighbors <- c(word, Replaces(word), Deletes(word),
                 Insertions(word), Transpositions(word))
  return(neighbors)
}

# Probability as determined by our corpus.
Probability <- function(word, dtm) {
  # Number of words, total
  N <- length(dtm)
  word.number <- which(names(dtm) == word)
  count <- dtm[word.number]
  pval <- count/N
  return(pval)
}

# Correct a single word.
Correct <- function(word, dtm) {
  neighbors <- Neighbors(word)
  # If it is a word, just return it.
  if (word %in% names(dtm)) {
    out <- word
  }
  # Otherwise, check for neighbors.
  else {
    # Which of the neighbors are known words?
    known <- which(neighbors %in% names(dtm))
    N.known <- length(known)
    # If there are no known neighbors, including the word,
    # look farther away.
    if (N.known == 0) {
      print(paste("Having a hard time matching '", word, "'...", sep = ""))
      neighbors <- unlist(lapply(neighbors, Neighbors))
    }
    # Then out non-words.
    neighbors <- neighbors[which(neighbors %in% names(dtm))]
    N <- length(neighbors)
    # If we found some neighbors, find the one with the highest
    # p-value.
    if (N >= 1) {
      P <- 0*(1:N)
      for (i in 1:N) {
        P[i] <- Probability(neighbors[i], dtm)
      }
      out <- neighbors[which.max(P)]
    }
    # If no neighbors still, return the word.
    else {
      out <- word
    }
  }
  return(out)
}

# Correct an entire document.
CorrectDocument <- function(document, dtm) {
  by.word <- unlist(strsplit(document, " "))
  N <- length(by.word)
  for (i in 1:N) {
    by.word[i] <- Correct(by.word[i], dtm = dtm)
  }
  corrected <- paste(by.word, collapse = " ")
  return(corrected)
}

The above functions generate “neighbors” of words, determine probabilities of the neighbors, and return the best ones. Function “CorrectDocument” will correct an entire document (with special characters and punctuation removed), and “Correct” will simply correct a word. Here are some sample runs.

> Correct("speling", dtm = counts)
        l4 
"spelling" 
> Correct("korrecter", dtm = counts)
[1] "Having a hard time matching 'korrecter'..."
      c1.d9 
"corrected" 
> CorrectDocument("the quick bruwn fowx jumpt ovre tha lasy dog", dtm = counts)
[1] "the quick brown fox jump over the last dog"

As you can see, this function is obviously not perfect. It will do some basic corrections automatically though, but there are some improvements to be made. More to come!

Decoding a Substitution Cipher using Simulated Annealing

My last post discussed a method to decode a substitution cipher using a Metropolis-Hastings algorithm. It was brought to my attention that this code could be improved by using Simulated Annealing methods to jump around the sample space and avoid some of the local maxima. Here is a basic description of the difference:

In a basic Metropolis-Hastings algorithm, it is relatively unlikely that we will jump to states of the system which decrease our cost function (note: we usually try to decrease the cost function, but in the code below our goal is to maximize it instead). The algorithm greedily tries to attain a maximum value of the cost function, and thus is very prone to hitting local maxima instead of the global one.

In contrast, when we use simulated Annealing methods we are much more likely to jump to “worse” states during early iterations. This is because in the early stages, we want to jump around the sample space and try to find where the global maximum will occur. We can also build a characteristic into the algorithm which will “cool-off” as the algorithm iterates. What I mean by this is that our likelihood to jump to “worse” states will slowly decrease as the algorithm iterates.

In a standard Metropolis-Hastings algorithm, we jump to state s^* with probability \frac{P(s^*)}{P(s)}, when P(s^*) < P(s). In my Annealing Method to decode a cipher, I first fix a temperature T, and move to state s^* with probability \frac{P(s^*)}{P(s)} \frac{TN}{Ti - 2i + N} when P(s^*) < P(s), where N is the number of iterations and i is the current iteration number. The reasoning behind this acceptance probability is that at first, we are basically multiplying the acceptance probability by T, and by the end we are basically not changing the acceptance probability at all from \frac{P(s^*)}{P(s)}. Try plugging in 0, and N for i to verify this. The result is the following transition probability function, which can perform both the Annealing and Metropolis-Hasting probability calculations.

# Function to calculate the transition probability.
Prob = function(pl.fstar, pl.f, temp, i, iter, method = "Anneal") {
  if(method == "Anneal") {
    flip = (pl.fstar/pl.f)*temp/((temp - 1)*(i/iter) + (iter-i)/iter)
  }
  else{
    flip = pl.fstar/pl.f
  }
  return(flip)
}

An additional variable that I added was a method of changing neighbors. In very large sample spaces, it can take a lot of transpositions to move from one state to another. In fact, the furthest state from any given state is \frac{n*(n-1)}{2}, where $n$ is the length of the permutation. In our case, that is 1326 moves. The chance that the previous algorithm would get stuck in a local maximum en route is very high. We combat this by allowing the neighbors of any particular permutation to be farther away. Consider the following neighbor-generating function.

# Function to generate a new candidate function.
# Inputs:
#  f: the function we want to find a neighbor of.
#  N: the domain size of f
#  top: the maximum number of elements to transpose.
Neighbor = function(f, N = length(f), top = 2) {
  # Decode how many elements to switch (between 2 and top)
  if(top == 2) {
    num = 2
  }
  else {
    num = sample(2:top, 1)
  }
  # Which elements to switch?
  swit = sample(1:N, num)
  
  # Decide how to transpose our elements
  if(num == 2) {
    ord = 2:1
  }
  else {
    ord = sample(1:num)
  }
  # Prevent the case where we do not
  # switch any elements.
  while(all(ord == 1:num)) {
    ord = sample(1:num)
  }
  # fstar will be a neighbor of f
  fstar = f
  fstar[swit] = f[swit[ord]]
  return(fstar)
}

We supply the function with the maximum number of elements, top, it is allowed to transpose, and it decides to transpose between 2 and top elements, at random (uniformly). It then decides on a random ordering of that transposition (there are a lot of ways to move around 10 different elements, for example). The function has a check to ensure that it does not simply output the same function as was input.

Another idea I came up with to allow the algorithm to zero-in on a good solution was to make the algorithm start doing single-transpositions eventually. This allows us to fine-tune our “good” solution in the later iterations of the algorithm. In the code, we use the “attack” parameter to determine how quickly we begin our fine-tuning of our solution. Small values of “attack” correspond with a lot of jumping around throughout the iterations, while high values correspond with us fine-tuning our algorithm earlier on in the iterations. The result is our final algorithm-decoding function.

# Generate the plausability of fstar
new.pl <- function(fstar, converted, len, trans) {
  test.mat = fstar[converted]
  compare = matrix(c(test.mat[1:(len -1)], test.mat[2:len]), ncol = 2)
  pl.fstar = prod(trans[compare])
  return(pl.fstar)
}
# Inputs:
#  trans: a matrix of transition probabilities
#  dictionary: list of possible characters
#  message: the secret message
#  iter: number of iterations
#  temp: the initial temperature of the system
#  attack: parameter to determine how quickly we
#          approach single-transposition versus
#          multiple-transpositions.  High values 
#          of "attack" will cause us to rapidly approach
#          single-transpositions.
DecodeMessage = function(trans, dictionary, message, iter = 1000,
                       temp = 10, method = "Anneal",
                       attack = 1) {
  # Size of the dictionary
  N = length(dictionary)
  # Number of characters in the message.
  len = length(message) 
  # Convert the matrix to the cooresponding numeric value in 
  # the dictionary.
  converted = 0*(1:len)
  for(i in 1:len) {
    converted[i] = which(dictionary == message[i])
  }
  # Take a random guess at the permutation:
  # f will be 'permanent', fstar will be a guess.
  f = sample(N, N)
  fstar = f
  best.guess = f
  # Keep track of the plausability of the best guess,
  # and of the current state of the system.
  pl.f = 0*(1:iter)
  pl.f.max = 0
  # There is no reason to not gerate all of the 
  # random numbers at once, since they are independent.
  test.val = runif(iter, 0, 1)
  for(i in 1:iter) {
    pl.fstar = new.pl(fstar, converted, len, trans)
    # flip is the probability that we switch to 
    # fstar.  Note that this function will output
    # values greater than 1, in which case we will always
    # switch.
    flip = Prob(pl.fstar, pl.f[i], temp, i, iter, method) 
    if(flip >= test.val[i]) {
      f = fstar
      pl.f[i] = pl.fstar
    }
    # Keep track of the best over-all guess.
    if(pl.f[i] > pl.f.max) {
      best.guess = f
      pl.f.max = pl.f[i]
    }
    # Periodically output the current state
    # of the decoding candidates.
    if(i %% 10000 == 0) {
      cat(dictionary[f[converted]][1:44])
      cat("\n")
    }
    # Generate the "top" input for the Neighbor function.
    top = round(2 + (N - 2)*((iter - i)/iter)^attack)
    # Generate a new candidate function,
    # and initiate next pl.f value.
    fstar = Neighbor(f, N, top)
    pl.f[i + 1] = pl.f[i]
  }
  return(list(current = dictionary[f[converted]], 
              pl.f = pl.f, 
              best.guess = dictionary[best.guess[converted]], 
              best.pl = pl.f.max))
}

Here we read in the data, and format the transitions matrix (to prevent infinity and 0 values of the plausibility function).

# Read in the matrix of transitions
trans = as.matrix(read.csv("http://www2.research.att.com/~volinsky/DataMining/Columbia2011/HW/KennyMCMC/wp_transition_counts.csv",
                   header = FALSE))
# Vector of all possible characters
dictionary = scan("http://www2.research.att.com/~volinsky/DataMining/Columbia2011/HW/KennyMCMC/wp_char_vector_sort.txt",
             what = character(), sep = "\n")
# The secret message
message = scan("http://www2.research.att.com/~volinsky/DataMining/Columbia2011/HW/KennyMCMC/message1.txt", 
             what = character())
# Separate the message into individual characters
message = unlist(strsplit(do.call(paste, as.list(message)), NULL))
# We change the probabilities of transition so that there are 
# no zero-values using a sort of Laplacian smoothing, and 
# find something to multiply the values by so that we dont get 0 
# probability (restrictions on Rs computational precision)
fixed.trans1 = log(trans + 1)
fixed.trans1 = 50*(fixed.trans1 + 1)/(rowSums(fixed.trans1) + 51)

# Set a seed to make this reproducible...
set.seed(8675309)
# Set a different, random seed, as to avoid some bias
# (your random number will still be the same as mine here)
set.seed(runif(1, 0, 999999))

Here is a couple of sample calls of the function. One uses the standard Metropolis-Hastings algorithm with the varying neighbor-generating function; the other uses the Annealing method and the varying neighbor-generating function. I semi-arbitrarily chose “attack” to be 10 here, and temp to be 100000

DecodeMessage(fixed.trans1, dictionary, message, 
            iter = 100000, temp = 10000,
            method = "Anneal", attack = 10) -> annel
u o t e s c o s p t c i t - o h e s t i s t c l o t n e d   c t i n t n o - d ' a d y t
z a t e s c a s p t c o t - a k e s t o s t c h a t u e l   c t o u t u a - l y i l b t
w a t e n c a n p t c o t g a d e n t o n t c h a t f e r   c t o f t f a g r u i r y t
r e   i n t e n d   t o   k e v i n   o n   t h e   w i s p t   o w   w e k s u a s f  
w e   i n c e n d   c o   r e g i n   o n   c h e   m i s t c   o m   m e r s u a s y  
w e   i n c e n d   c o   b e g i n   o n   c h e   f i s t c   o f   f e b s u a s y  
w e   i n t e n d   t o   b e g i n   o n   t h e   f i l s t   o f   f e b l u a l y  
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s t   o f   f e b r u a r y  
w e   i n t e n d   t o   r e g i n   o n   t h e   v i s f t   o v   v e r s u a s y  
w e   i n t e n d   t o   p e g i n   o n   t h e   f i l s t   o f   f e p l u a l z  
DecodeMessage(fixed.trans1, dictionary, message, 
              iter = 100000, method = "Metrop",
              attack = 10) -> metrop
p i   a n s i n v   s o   c i g a n   o n   s h i   f a d m s   o f   f i c d u e d w  
k e   i n c e n f   c o   y e g i n   o n   c h e   w i s t c   o w   w e y s u a s "  
k e   i n c e n d   c o   w e g i n   o n   c h e   f i s t c   o f   f e w s u a s y  
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s t   o f   f e b r u a r y  
w e   i n t e n d   t o   b e g i n   o n   t h e   f i l s t   o f   f e b l u a l y  
w e   i n t e n d   t o   y e g i n   o n   t h e   f i r s t   o f   f e y r u a r v  
w e   i n t e n g   t o   p e d i n   o n   t h e   v i r f t   o v   v e p r u a r k  
w e   i n t e n v   t o   b e g i n   o n   t h e   f i r s t   o f   f e b r u a r y  
w e   i n t e n d   t o   z e k i n   o n   t h e   f i l s t   o f   f e z l u a l y  
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s t   o f   f e b r u a r y  

We get the following graph of the Plausibility, for each method. The black is the Annealing method, the red is the Metropolis method.

Note that the Annealing method drops down more often, but the two eventually converge to similar values. Here are the final decodings of the message by the different methods:

> cat(annel$best.guess, fill = 60)
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s 
t   o f   f e b r u a r y   u n r e s t r i c t e d   s u b 
m a r i n e   w a r f a r e !   w e   s h a l l   e n d e a 
k o r   i n   s p i t e   o f   t h i s   t o   v e e p   t 
h e   u n i t e d   s t a t e s   o f   a m e r i c a   n e 
u t r a l !   i n   t h e   e k e n t   o f   t h i s   n o 
t   s u c c e e d i n g ,   w e   m a v e   m e x i c o   a 
  p r o p o s a l   o f   a l l i a n c e   o n   t h e   f 
o l l o w i n g   b a s i s ?   m a v e   w a r   t o g e t 
h e r ,   m a v e   p e a c e   t o g e t h e r ,   g e n e 
r o u s   f i n a n c i a l   s u p p o r t   a n d   a n   
u n d e r s t a n d i n g   o n   o u r   p a r t   t h a t 
  m e x i c o   i s   t o   r e c o n q u e r   t h e   l o 
s t   t e r r i t o r y   i n   t e x a s ,   n e w   m e x 
i c o ,   a n d   a r i z o n a !   t h e   s e t t l e m e 
n t   i n   d e t a i l   i s   l e f t   t o   y o u !   y 
o u   w i l l   i n f o r m   t h e   p r e s i d e n t   o 
f   t h e   a b o k e   m o s t   s e c r e t l y   a s   s 
o n   a s   t h e   o u t b r e a v   o f   w a r   w i t h 
  t h e   u n i t e d   s t a t e s   o f   a m e r i c a   
i s   c e r t a i n   a n d   a d d   t h e   s u g g e s t 
i o n   t h a t   h e   s h o u l d ,   o n   h i s   o w n 
  i n i t i a t i k e ,   i n k i t e   j a p a n   t o   i 
m m e d i a t e   a d h e r e n c e   a n d   a t   t h e   
s a m e   t i m e   m e d i a t e   b e t w e e n   j a p a 
n   a n d   o u r s e l k e s !   p l e a s e   c a l l   t 
h e   p r e s i d e n t . s   a t t e n t i o n   t o   t h 
e   f a c t   t h a t   t h e   r u t h l e s s   e m p l o 
y m e n t   o f   o u r   s u b m a r i n e s   n o w   o f 
f e r s   t h e   p r o s p e c t   o f   c o m p e l l i n 
g   e n g l a n d   i n   a   f e w   m o n t h s   t o   m 
a v e   p e a c e !
> cat(metrop$best.guess, fill = 60)
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s 
t   o f   f e b r u a r y   u n r e s t r i c t e d   s u b 
m a r i n e   w a r f a r e .   w e   s h a l l   e n d e a 
v o r   i n   s p i t e   o f   t h i s   t o   k e e p   t 
h e   u n i t e d   s t a t e s   o f   a m e r i c a   n e 
u t r a l .   i n   t h e   e v e n t   o f   t h i s   n o 
t   s u c c e e d i n g ,   w e   m a k e   m e x i c o   a 
  p r o p o s a l   o f   a l l i a n c e   o n   t h e   f 
o l l o w i n g   b a s i s :   m a k e   w a r   t o g e t 
h e r ,   m a k e   p e a c e   t o g e t h e r ,   g e n e 
r o u s   f i n a n c i a l   s u p p o r t   a n d   a n   
u n d e r s t a n d i n g   o n   o u r   p a r t   t h a t 
  m e x i c o   i s   t o   r e c o n j u e r   t h e   l o 
s t   t e r r i t o r y   i n   t e x a s ,   n e w   m e x 
i c o ,   a n d   a r i z o n a .   t h e   s e t t l e m e 
n t   i n   d e t a i l   i s   l e f t   t o   y o u .   y 
o u   w i l l   i n f o r m   t h e   p r e s i d e n t   o 
f   t h e   a b o v e   m o s t   s e c r e t l y   a s   s 
o n   a s   t h e   o u t b r e a k   o f   w a r   w i t h 
  t h e   u n i t e d   s t a t e s   o f   a m e r i c a   
i s   c e r t a i n   a n d   a d d   t h e   s u g g e s t 
i o n   t h a t   h e   s h o u l d ,   o n   h i s   o w n 
  i n i t i a t i v e ,   i n v i t e   " a p a n   t o   i 
m m e d i a t e   a d h e r e n c e   a n d   a t   t h e   
s a m e   t i m e   m e d i a t e   b e t w e e n   " a p a 
n   a n d   o u r s e l v e s .   p l e a s e   c a l l   t 
h e   p r e s i d e n t ' s   a t t e n t i o n   t o   t h 
e   f a c t   t h a t   t h e   r u t h l e s s   e m p l o 
y m e n t   o f   o u r   s u b m a r i n e s   n o w   o f 
f e r s   t h e   p r o s p e c t   o f   c o m p e l l i n 
g   e n g l a n d   i n   a   f e w   m o n t h s   t o   m 
a k e   p e a c e .

In this case, the Metropolis method seems to have found the better decoding. But that is not necessarily the point. The goal of the Annealing method is to avoid local maxima, and it was successful in that pursuit (as evidenced by the readability of the message).

I have had varying results with both methods, depending on parameter inputs. It is not clear which one is better of these two, and I do not have the computing power to benchmark them. In any event, there is probably some fine-tuning of the parameters that could be done in order to improve the efficacy of the decoding methods. They do seem to both come close to the correct message more often than the code in my last post.

This post is linked to on http://www.R-bloggers.com

Decoding A Simple Cipher

My friend introduced me to a web site for a data mining course at Columbia, and I have been pretty hooked on some of the projects. One of the most recent ones I’ve worked on was decoding a simple cipher, using a Metropolis-Hastings algorithm. It is a pretty amazing little technique, which jumps around the sample space of 51! possible substitutions (something like 10^66 possible choices) and yet still quickly converges to a hopefully decoded version of the message. I finally got it to work, and the code runs in around 15 seconds. Check it out.

# Function to decode a message of symbols or characters, given
# A matrix of transitions, a dictionary, and a secret message
Decode = function(trans, dictionary, message, iter = 10,
                   thresh = 3) {
  # Size of the dictionary
  N = length(dictionary)
  # Number of characters in the message.
  len = length(message) 
  # Convert the matrix to the cooresponding numeric value in 
  # the dictionary.
  converted = 0*(1:len)
  for(i in 1:len) {
    converted[i] = which(dictionary == message[i])
  }
  # Take a random guess at the permutation:
  # f will be 'permanent', fstar will be a guess.
  f = sample(N, N)
  fstar = f
  pl.f = 0
  # Keep track of the plausability of the best guess,
  # and of each individual guess. 
  k = 1
  rem = 0
  rem.plfstar = 0*(1:iter)
  top.plaus = 0
  stopped = 0
  for(i in 1:iter) {
    # The current test of the decoded message.
    test.mat = fstar[converted]
    # This is a len(message)x 2 matrix of indecies, to index the 
    # matrix containing the transision probabilities/frequencies.
    compare = matrix(c(test.mat[1:(len -1)], test.mat[2:len]), ncol = 2)
    # The plausability of fstar is the product of the probabilities 
    # in the transition matrix.
    pl.fstar = prod(trans[compare])
    # If the plausibility of fstar is better than f, change f
    # to fstar.
    if(pl.fstar > pl.f) {
      f = fstar
      pl.f = pl.fstar
      # Record the new plausibility.
      rem[k] = pl.f
      k = k + 1
    }
    # If the plausibility did NOT increase, 
    # change to the new guess with p value
    # pl.fstar/pl.f
    else {
      flip = pl.fstar/(pl.f + 1E-99)
      if (runif(1, 0, 1) < flip) {
        f = fstar
        rem[k] = pl.fstar
        k = k + 1
      }
    }
    # Periodically output the best effort
    # at decoding the message.
    if(i %% 10000 == 0) {
      pl.change = pl.f - top.plaus
      if(pl.change == 0) {
        stopped = stopped + 1
        if (stopped == thresh) {
          cat("Chain Converged \n")
          break
        }
      }
      else {
        stopped = 0
        cat(dictionary[f[converted]][1:44])
        cat("\n")
        cat(pl.f)
        cat("\n")
      }
      top.plaus = pl.f
    }
    # After each iterations, randomly transpose
    # two entries of our best guess and assign them
    # to fstar.
    swit = sample(1:N, 2)
    fstar = f
    fstar[swit] = f[rev(swit)]
  }
  return(list(dictionary[f[converted]],
              rem))
}

It makes sense to eventually add some capability to test whether the algorithm is jumping around properly (is the probability of f changing?) and to start the chain at different points, since this algorithm is not guaranteed to converge. It may reach a local maximum, and never get to jump to the final solution. The following code scans in the data, and formats the transition probabilities using the log p-value and a little bit of Laplacian smoothing.

# Read in the matrix of transitions
trans = as.matrix(read.csv("http://www2.research.att.com/~volinsky/DataMining/Columbia2011/HW/KennyMCMC/wp_transition_counts.csv",
                   header = FALSE))
# Vector of all possible characters
dictionary = scan("http://www2.research.att.com/~volinsky/DataMining/Columbia2011/HW/KennyMCMC/wp_char_vector_sort.txt",
             what = character(), sep = "\n")
# The secret message
message = scan("http://www2.research.att.com/~volinsky/DataMining/Columbia2011/HW/KennyMCMC/message1.txt", 
             what = character())
# Separate the message into individual characters
message = unlist(strsplit(do.call(paste, as.list(message)), NULL))
fixed.trans1 = log(trans + 1)
fixed.trans1 = 50*(fixed.trans1 + 1)/(rowSums(fixed.trans1) + 51)

And here is a sample call of the function, with output. It prints out the best decoding every 10,000 iterations, and will stop trying if the message doesn’t change at all for 30,000 iterations.

> cat(message)
, ' r k . t ' . s r t   r n ' ( k . r   . r t * ' r ; k ! d t r   ; r ; ' n ! 6 c ! 2 r 6 . ! ' d t ! k o t ' s r d 6 n 1 c ! k . ' r , c ! ; c ! ' 9 r , ' r d * c 3 3 r ' . s ' c h   ! r k . r d e k t ' r   ; r t * k d r t   r z ' ' e r t * ' r 6 . k t ' s r d t c t ' d r   ; r c 1 ' ! k o c r . ' 6 t ! c 3 9 r k . r t * ' r ' h ' . t r   ; r t * k d r .   t r d 6 o o ' ' s k . ( 8 r , ' r 1 c z ' r 1 ' / k o   r c r e !   e   d c 3 r   ; r c 3 3 k c . o ' r   . r t * ' r ;   3 3   , k . ( r n c d k d l r 1 c z ' r , c ! r t   ( ' t * ' ! 8 r 1 c z ' r e ' c o ' r t   ( ' t * ' ! 8 r ( ' . ' !   6 d r ; k . c . o k c 3 r d 6 e e   ! t r c . s r c . r 6 . s ' ! d t c . s k . ( r   . r   6 ! r e c ! t r t * c t r 1 ' / k o   r k d r t   r ! ' o   . m 6 ' ! r t * ' r 3   d t r t ' ! ! k t   ! 2 r k . r t ' / c d 8 r . ' , r 1 ' / k o   8 r c . s r c ! k w   . c 9 r t * ' r d ' t t 3 ' 1 ' . t r k . r s ' t c k 3 r k d r 3 ' ; t r t   r 2   6 9 r 2   6 r , k 3 3 r k . ;   ! 1 r t * ' r e ! ' d k s ' . t r   ; r t * ' r c n   h ' r 1   d t r d ' o ! ' t 3 2 r c d r d   . r c d r t * ' r   6 t n ! ' c z r   ; r , c ! r , k t * r t * ' r 6 . k t ' s r d t c t ' d r   ; r c 1 ' ! k o c r k d r o ' ! t c k . r c . s r c s s r t * ' r d 6 ( ( ' d t k   . r t * c t r * ' r d *   6 3 s 8 r   . r * k d r   , . r k . k t k c t k h ' 8 r k . h k t ' r ) c e c . r t   r k 1 1 ' s k c t ' r c s * ' ! ' . o ' r c . s r c t r t * ' r d c 1 ' r t k 1 ' r 1 ' s k c t ' r n ' t , ' ' . r ) c e c . r c . s r   6 ! d ' 3 h ' d 9 r e 3 ' c d ' r o c 3 3 r t * ' r e ! ' d k s ' . t b d r c t t ' . t k   . r t   r t * ' r ; c o t r t * c t r t * ' r ! 6 t * 3 ' d d r ' 1 e 3   2 1 ' . t r   ; r   6 ! r d 6 n 1 c ! k . ' d r .   , r   ; ; ' ! d r t * ' r e !   d e ' o t r   ; r o   1 e ' 3 3 k . ( r ' . ( 3 c . s r k . r c r ; ' , r 1   . t * d r t   r 1 c z ' r e ' c o ' 9
> Decode(fixed.trans1, dictionary, message, iter = 100000) -> test2
w e   i n t e n g   t o   m e k i n   o n   t h e   f i r s t   o f   f e m r u a r y  
3.082826e+263
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s t   o f   f e b r u a r y  
2.657652e+267
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s t   o f   f e b r u a r y  
7.711922e+267
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s t   o f   f e b r u a r y  
8.556139e+268
Chain Converged 

It doesn’t always work like you would hope, but here is the original message, followed by our decoded version. Note that in the final version, the algorithm decided that “d” and “g” should be switched.

 cat(message, fill = 88)
, ' r k . t ' . s r t   r n ' ( k . r   . r t * ' r ; k ! d t r   ; r ; ' n ! 6 c ! 2 r 
6 . ! ' d t ! k o t ' s r d 6 n 1 c ! k . ' r , c ! ; c ! ' 9 r , ' r d * c 3 3 r ' . s 
' c h   ! r k . r d e k t ' r   ; r t * k d r t   r z ' ' e r t * ' r 6 . k t ' s r d t 
c t ' d r   ; r c 1 ' ! k o c r . ' 6 t ! c 3 9 r k . r t * ' r ' h ' . t r   ; r t * k 
d r .   t r d 6 o o ' ' s k . ( 8 r , ' r 1 c z ' r 1 ' / k o   r c r e !   e   d c 3 r 
  ; r c 3 3 k c . o ' r   . r t * ' r ;   3 3   , k . ( r n c d k d l r 1 c z ' r , c ! 
r t   ( ' t * ' ! 8 r 1 c z ' r e ' c o ' r t   ( ' t * ' ! 8 r ( ' . ' !   6 d r ; k . 
c . o k c 3 r d 6 e e   ! t r c . s r c . r 6 . s ' ! d t c . s k . ( r   . r   6 ! r e 
c ! t r t * c t r 1 ' / k o   r k d r t   r ! ' o   . m 6 ' ! r t * ' r 3   d t r t ' ! 
! k t   ! 2 r k . r t ' / c d 8 r . ' , r 1 ' / k o   8 r c . s r c ! k w   . c 9 r t * 
' r d ' t t 3 ' 1 ' . t r k . r s ' t c k 3 r k d r 3 ' ; t r t   r 2   6 9 r 2   6 r , 
k 3 3 r k . ;   ! 1 r t * ' r e ! ' d k s ' . t r   ; r t * ' r c n   h ' r 1   d t r d 
' o ! ' t 3 2 r c d r d   . r c d r t * ' r   6 t n ! ' c z r   ; r , c ! r , k t * r t 
* ' r 6 . k t ' s r d t c t ' d r   ; r c 1 ' ! k o c r k d r o ' ! t c k . r c . s r c 
s s r t * ' r d 6 ( ( ' d t k   . r t * c t r * ' r d *   6 3 s 8 r   . r * k d r   , . 
r k . k t k c t k h ' 8 r k . h k t ' r ) c e c . r t   r k 1 1 ' s k c t ' r c s * ' ! 
' . o ' r c . s r c t r t * ' r d c 1 ' r t k 1 ' r 1 ' s k c t ' r n ' t , ' ' . r ) c 
e c . r c . s r   6 ! d ' 3 h ' d 9 r e 3 ' c d ' r o c 3 3 r t * ' r e ! ' d k s ' . t 
b d r c t t ' . t k   . r t   r t * ' r ; c o t r t * c t r t * ' r ! 6 t * 3 ' d d r ' 
1 e 3   2 1 ' . t r   ; r   6 ! r d 6 n 1 c ! k . ' d r .   , r   ; ; ' ! d r t * ' r e 
!   d e ' o t r   ; r o   1 e ' 3 3 k . ( r ' . ( 3 c . s r k . r c r ; ' , r 1   . t * 
d r t   r 1 c z ' r e ' c o ' 9
> cat(test2[[1]], fill = 88)
w e   i n t e n g   t o   b e d i n   o n   t h e   f i r s t   o f   f e b r u a r y   
u n r e s t r i c t e g   s u b m a r i n e   w a r f a r e ,   w e   s h a l l   e n g 
e a v o r   i n   s p i t e   o f   t h i s   t o   k e e p   t h e   u n i t e g   s t 
a t e s   o f   a m e r i c a   n e u t r a l ,   i n   t h e   e v e n t   o f   t h i 
s   n o t   s u c c e e g i n d .   w e   m a k e   m e x i c o   a   p r o p o s a l   
o f   a l l i a n c e   o n   t h e   f o l l o w i n d   b a s i s q   m a k e   w a r 
  t o d e t h e r .   m a k e   p e a c e   t o d e t h e r .   d e n e r o u s   f i n 
a n c i a l   s u p p o r t   a n g   a n   u n g e r s t a n g i n d   o n   o u r   p 
a r t   t h a t   m e x i c o   i s   t o   r e c o n j u e r   t h e   l o s t   t e r 
r i t o r y   i n   t e x a s .   n e w   m e x i c o .   a n g   a r i ! o n a ,   t h 
e   s e t t l e m e n t   i n   g e t a i l   i s   l e f t   t o   y o u ,   y o u   w 
i l l   i n f o r m   t h e   p r e s i g e n t   o f   t h e   a b o v e   m o s t   s 
e c r e t l y   a s   s o n   a s   t h e   o u t b r e a k   o f   w a r   w i t h   t 
h e   u n i t e g   s t a t e s   o f   a m e r i c a   i s   c e r t a i n   a n g   a 
g g   t h e   s u d d e s t i o n   t h a t   h e   s h o u l g .   o n   h i s   o w n 
  i n i t i a t i v e .   i n v i t e   ( a p a n   t o   i m m e g i a t e   a g h e r 
e n c e   a n g   a t   t h e   s a m e   t i m e   m e g i a t e   b e t w e e n   ( a 
p a n   a n g   o u r s e l v e s ,   p l e a s e   c a l l   t h e   p r e s i g e n t 
" s   a t t e n t i o n   t o   t h e   f a c t   t h a t   t h e   r u t h l e s s   e 
m p l o y m e n t   o f   o u r   s u b m a r i n e s   n o w   o f f e r s   t h e   p 
r o s p e c t   o f   c o m p e l l i n d   e n d l a n g   i n   a   f e w   m o n t h 
s   t o   m a k e   p e a c e ,

A little Googling will tell us that this is “The Zimmerman Telegram.”

Although this is a simple type of cipher that could be done quite easily with simple frequency analysis of the characters, the method is an interesting one.

References:
Diaconis, Persi; “The Markov Chain Monte Carlo Revolution”