Category Archives: Uncategorized

Database Reflection using dplyr

At work I write a ton of SQL, and I do most of my querying using R.  The workflow goes:

  1. Create a string with the SQL in R
  2. Plug the string into fetchQuery (see my previous post)

This solution works relatively well, but i’m a bit unhappy writing strings rather than using function calls.

I began working on my own ORM implementation, but it was very slow-go and it would have taken a lot of time to get anywhere.  Luckily, I was pleasantly surprised that Hadley Wickham’s new dplyr package implements much of the ORM I was hoping for.

One thing I want out of an ORM is the ability to see every table in our databases all at once.  That functionality, while implement-able using the dplyr package, would likely take quite a while on tens of thousands of tables.  So I decided to implement the reflection myself.

 

 

#' Get the table information for a postgres database
#' @param config the configuration list
#' @return the table names, columns, and column types of all columns in the database
getTableInformation <- function(config = config.gp) {
  tables <- fetchQuery(
    "SELECT table_name, column_name, data_type 
    FROM information_schema.columns 
    WHERE table_name NOT LIKE '%prt%' 
      AND table_name NOT LIKE '%ext%' 
      AND table_name NOT LIKE '%tmp%' 
    ORDER BY 1, 2",
    config
  )
}

#' Replacement of the normal update function, you don't need to call this.
update <- function(object, ...) {
    args <- list(...)
    for (nm in names(args)) {
        object[[nm]] <- args[[nm]]
    }
    if (is.null(object$select)) {
        if (is.ident(object$from)) {
            var_names <- object$select
        }
        else {
            var_names <- qry_fields(object$src$con, object$from)
        }
        vars <- lapply(var_names, as.name)
        object$select <- vars
    }
    object$query <- dplyr:::build_query(object)
    object
}
#' Function to reflect a database, generalizable to others beyond postgres 
#' by simply changing getTableInformation appropriately
reflectDatabase <- function(config, envir.name = "tables",
                            subclass = "postgres") {
  if (!(envir.name %in% search())) {
    envir <- new.env(parent = .GlobalEnv)
  } else {
    envir <- as.environment(envir.name)
    detach(envir.name, character.only = TRUE)
  }
  src <- do.call(src_postgres, config)
  tables <- getTableInformation(config)
  tables <- split(tables, tables$table_name)
  lapply(tables, function(i) {
    nm <- ident(i$table_name[1])
    vars <- lapply(i$column_name, as.name)
    tbl <- dplyr::make_tbl(c(subclass, "sql"), src = src, from = nm,
                    select = vars, summarise = FALSE, mutate = FALSE,
                    where = NULL, group_by = NULL, order_by = NULL)
    tbl <- update(tbl)
    assign(
      nm,
      tbl,
      envir = envir
    )
  })
  attach(envir, name = envir.name)
}

searchTables <- function(str, env = "tables") {
  all.tbls <- ls(env)
  all.tbls[grep(str, all.tbls)]
}

To use this function, you can simply call (with your config, as specified in my last post, included)

reflectDatabase(config)

and if you’re using a Postgres database, that should be it!

The fun part now, is that I can do things like

res <- inner_join(my_table_1, my_table_2)

where my_table_1 and my_table_2 are simply names of tables in my database. This provides me with auto-complete of table names, search-able table names and columns, etc.

For example:

searchTables('user')

returns all tables in our database with the string “user” in them.

These are some things I hope to see or find in dplyr, and may try to build myself if they don’t already exist:
1. Case statements in mutate
2. Creating table indexes
3. type checking of columns, and more informative error messages when un-sensible joins and filters are performed.

Overall this package seems like a lot of fun, and i’m excited to try to work it into my coding!

Advertisements

Decoding A Simple Cipher

My friend introduced me to a web site for a data mining course at Columbia, and I have been pretty hooked on some of the projects. One of the most recent ones I’ve worked on was decoding a simple cipher, using a Metropolis-Hastings algorithm. It is a pretty amazing little technique, which jumps around the sample space of 51! possible substitutions (something like 10^66 possible choices) and yet still quickly converges to a hopefully decoded version of the message. I finally got it to work, and the code runs in around 15 seconds. Check it out.

# Function to decode a message of symbols or characters, given
# A matrix of transitions, a dictionary, and a secret message
Decode = function(trans, dictionary, message, iter = 10,
                   thresh = 3) {
  # Size of the dictionary
  N = length(dictionary)
  # Number of characters in the message.
  len = length(message) 
  # Convert the matrix to the cooresponding numeric value in 
  # the dictionary.
  converted = 0*(1:len)
  for(i in 1:len) {
    converted[i] = which(dictionary == message[i])
  }
  # Take a random guess at the permutation:
  # f will be 'permanent', fstar will be a guess.
  f = sample(N, N)
  fstar = f
  pl.f = 0
  # Keep track of the plausability of the best guess,
  # and of each individual guess. 
  k = 1
  rem = 0
  rem.plfstar = 0*(1:iter)
  top.plaus = 0
  stopped = 0
  for(i in 1:iter) {
    # The current test of the decoded message.
    test.mat = fstar[converted]
    # This is a len(message)x 2 matrix of indecies, to index the 
    # matrix containing the transision probabilities/frequencies.
    compare = matrix(c(test.mat[1:(len -1)], test.mat[2:len]), ncol = 2)
    # The plausability of fstar is the product of the probabilities 
    # in the transition matrix.
    pl.fstar = prod(trans[compare])
    # If the plausibility of fstar is better than f, change f
    # to fstar.
    if(pl.fstar > pl.f) {
      f = fstar
      pl.f = pl.fstar
      # Record the new plausibility.
      rem[k] = pl.f
      k = k + 1
    }
    # If the plausibility did NOT increase, 
    # change to the new guess with p value
    # pl.fstar/pl.f
    else {
      flip = pl.fstar/(pl.f + 1E-99)
      if (runif(1, 0, 1) < flip) {
        f = fstar
        rem[k] = pl.fstar
        k = k + 1
      }
    }
    # Periodically output the best effort
    # at decoding the message.
    if(i %% 10000 == 0) {
      pl.change = pl.f - top.plaus
      if(pl.change == 0) {
        stopped = stopped + 1
        if (stopped == thresh) {
          cat("Chain Converged \n")
          break
        }
      }
      else {
        stopped = 0
        cat(dictionary[f[converted]][1:44])
        cat("\n")
        cat(pl.f)
        cat("\n")
      }
      top.plaus = pl.f
    }
    # After each iterations, randomly transpose
    # two entries of our best guess and assign them
    # to fstar.
    swit = sample(1:N, 2)
    fstar = f
    fstar[swit] = f[rev(swit)]
  }
  return(list(dictionary[f[converted]],
              rem))
}

It makes sense to eventually add some capability to test whether the algorithm is jumping around properly (is the probability of f changing?) and to start the chain at different points, since this algorithm is not guaranteed to converge. It may reach a local maximum, and never get to jump to the final solution. The following code scans in the data, and formats the transition probabilities using the log p-value and a little bit of Laplacian smoothing.

# Read in the matrix of transitions
trans = as.matrix(read.csv("http://www2.research.att.com/~volinsky/DataMining/Columbia2011/HW/KennyMCMC/wp_transition_counts.csv",
                   header = FALSE))
# Vector of all possible characters
dictionary = scan("http://www2.research.att.com/~volinsky/DataMining/Columbia2011/HW/KennyMCMC/wp_char_vector_sort.txt",
             what = character(), sep = "\n")
# The secret message
message = scan("http://www2.research.att.com/~volinsky/DataMining/Columbia2011/HW/KennyMCMC/message1.txt", 
             what = character())
# Separate the message into individual characters
message = unlist(strsplit(do.call(paste, as.list(message)), NULL))
fixed.trans1 = log(trans + 1)
fixed.trans1 = 50*(fixed.trans1 + 1)/(rowSums(fixed.trans1) + 51)

And here is a sample call of the function, with output. It prints out the best decoding every 10,000 iterations, and will stop trying if the message doesn’t change at all for 30,000 iterations.

> cat(message)
, ' r k . t ' . s r t   r n ' ( k . r   . r t * ' r ; k ! d t r   ; r ; ' n ! 6 c ! 2 r 6 . ! ' d t ! k o t ' s r d 6 n 1 c ! k . ' r , c ! ; c ! ' 9 r , ' r d * c 3 3 r ' . s ' c h   ! r k . r d e k t ' r   ; r t * k d r t   r z ' ' e r t * ' r 6 . k t ' s r d t c t ' d r   ; r c 1 ' ! k o c r . ' 6 t ! c 3 9 r k . r t * ' r ' h ' . t r   ; r t * k d r .   t r d 6 o o ' ' s k . ( 8 r , ' r 1 c z ' r 1 ' / k o   r c r e !   e   d c 3 r   ; r c 3 3 k c . o ' r   . r t * ' r ;   3 3   , k . ( r n c d k d l r 1 c z ' r , c ! r t   ( ' t * ' ! 8 r 1 c z ' r e ' c o ' r t   ( ' t * ' ! 8 r ( ' . ' !   6 d r ; k . c . o k c 3 r d 6 e e   ! t r c . s r c . r 6 . s ' ! d t c . s k . ( r   . r   6 ! r e c ! t r t * c t r 1 ' / k o   r k d r t   r ! ' o   . m 6 ' ! r t * ' r 3   d t r t ' ! ! k t   ! 2 r k . r t ' / c d 8 r . ' , r 1 ' / k o   8 r c . s r c ! k w   . c 9 r t * ' r d ' t t 3 ' 1 ' . t r k . r s ' t c k 3 r k d r 3 ' ; t r t   r 2   6 9 r 2   6 r , k 3 3 r k . ;   ! 1 r t * ' r e ! ' d k s ' . t r   ; r t * ' r c n   h ' r 1   d t r d ' o ! ' t 3 2 r c d r d   . r c d r t * ' r   6 t n ! ' c z r   ; r , c ! r , k t * r t * ' r 6 . k t ' s r d t c t ' d r   ; r c 1 ' ! k o c r k d r o ' ! t c k . r c . s r c s s r t * ' r d 6 ( ( ' d t k   . r t * c t r * ' r d *   6 3 s 8 r   . r * k d r   , . r k . k t k c t k h ' 8 r k . h k t ' r ) c e c . r t   r k 1 1 ' s k c t ' r c s * ' ! ' . o ' r c . s r c t r t * ' r d c 1 ' r t k 1 ' r 1 ' s k c t ' r n ' t , ' ' . r ) c e c . r c . s r   6 ! d ' 3 h ' d 9 r e 3 ' c d ' r o c 3 3 r t * ' r e ! ' d k s ' . t b d r c t t ' . t k   . r t   r t * ' r ; c o t r t * c t r t * ' r ! 6 t * 3 ' d d r ' 1 e 3   2 1 ' . t r   ; r   6 ! r d 6 n 1 c ! k . ' d r .   , r   ; ; ' ! d r t * ' r e !   d e ' o t r   ; r o   1 e ' 3 3 k . ( r ' . ( 3 c . s r k . r c r ; ' , r 1   . t * d r t   r 1 c z ' r e ' c o ' 9
> Decode(fixed.trans1, dictionary, message, iter = 100000) -> test2
w e   i n t e n g   t o   m e k i n   o n   t h e   f i r s t   o f   f e m r u a r y  
3.082826e+263
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s t   o f   f e b r u a r y  
2.657652e+267
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s t   o f   f e b r u a r y  
7.711922e+267
w e   i n t e n d   t o   b e g i n   o n   t h e   f i r s t   o f   f e b r u a r y  
8.556139e+268
Chain Converged 

It doesn’t always work like you would hope, but here is the original message, followed by our decoded version. Note that in the final version, the algorithm decided that “d” and “g” should be switched.

 cat(message, fill = 88)
, ' r k . t ' . s r t   r n ' ( k . r   . r t * ' r ; k ! d t r   ; r ; ' n ! 6 c ! 2 r 
6 . ! ' d t ! k o t ' s r d 6 n 1 c ! k . ' r , c ! ; c ! ' 9 r , ' r d * c 3 3 r ' . s 
' c h   ! r k . r d e k t ' r   ; r t * k d r t   r z ' ' e r t * ' r 6 . k t ' s r d t 
c t ' d r   ; r c 1 ' ! k o c r . ' 6 t ! c 3 9 r k . r t * ' r ' h ' . t r   ; r t * k 
d r .   t r d 6 o o ' ' s k . ( 8 r , ' r 1 c z ' r 1 ' / k o   r c r e !   e   d c 3 r 
  ; r c 3 3 k c . o ' r   . r t * ' r ;   3 3   , k . ( r n c d k d l r 1 c z ' r , c ! 
r t   ( ' t * ' ! 8 r 1 c z ' r e ' c o ' r t   ( ' t * ' ! 8 r ( ' . ' !   6 d r ; k . 
c . o k c 3 r d 6 e e   ! t r c . s r c . r 6 . s ' ! d t c . s k . ( r   . r   6 ! r e 
c ! t r t * c t r 1 ' / k o   r k d r t   r ! ' o   . m 6 ' ! r t * ' r 3   d t r t ' ! 
! k t   ! 2 r k . r t ' / c d 8 r . ' , r 1 ' / k o   8 r c . s r c ! k w   . c 9 r t * 
' r d ' t t 3 ' 1 ' . t r k . r s ' t c k 3 r k d r 3 ' ; t r t   r 2   6 9 r 2   6 r , 
k 3 3 r k . ;   ! 1 r t * ' r e ! ' d k s ' . t r   ; r t * ' r c n   h ' r 1   d t r d 
' o ! ' t 3 2 r c d r d   . r c d r t * ' r   6 t n ! ' c z r   ; r , c ! r , k t * r t 
* ' r 6 . k t ' s r d t c t ' d r   ; r c 1 ' ! k o c r k d r o ' ! t c k . r c . s r c 
s s r t * ' r d 6 ( ( ' d t k   . r t * c t r * ' r d *   6 3 s 8 r   . r * k d r   , . 
r k . k t k c t k h ' 8 r k . h k t ' r ) c e c . r t   r k 1 1 ' s k c t ' r c s * ' ! 
' . o ' r c . s r c t r t * ' r d c 1 ' r t k 1 ' r 1 ' s k c t ' r n ' t , ' ' . r ) c 
e c . r c . s r   6 ! d ' 3 h ' d 9 r e 3 ' c d ' r o c 3 3 r t * ' r e ! ' d k s ' . t 
b d r c t t ' . t k   . r t   r t * ' r ; c o t r t * c t r t * ' r ! 6 t * 3 ' d d r ' 
1 e 3   2 1 ' . t r   ; r   6 ! r d 6 n 1 c ! k . ' d r .   , r   ; ; ' ! d r t * ' r e 
!   d e ' o t r   ; r o   1 e ' 3 3 k . ( r ' . ( 3 c . s r k . r c r ; ' , r 1   . t * 
d r t   r 1 c z ' r e ' c o ' 9
> cat(test2[[1]], fill = 88)
w e   i n t e n g   t o   b e d i n   o n   t h e   f i r s t   o f   f e b r u a r y   
u n r e s t r i c t e g   s u b m a r i n e   w a r f a r e ,   w e   s h a l l   e n g 
e a v o r   i n   s p i t e   o f   t h i s   t o   k e e p   t h e   u n i t e g   s t 
a t e s   o f   a m e r i c a   n e u t r a l ,   i n   t h e   e v e n t   o f   t h i 
s   n o t   s u c c e e g i n d .   w e   m a k e   m e x i c o   a   p r o p o s a l   
o f   a l l i a n c e   o n   t h e   f o l l o w i n d   b a s i s q   m a k e   w a r 
  t o d e t h e r .   m a k e   p e a c e   t o d e t h e r .   d e n e r o u s   f i n 
a n c i a l   s u p p o r t   a n g   a n   u n g e r s t a n g i n d   o n   o u r   p 
a r t   t h a t   m e x i c o   i s   t o   r e c o n j u e r   t h e   l o s t   t e r 
r i t o r y   i n   t e x a s .   n e w   m e x i c o .   a n g   a r i ! o n a ,   t h 
e   s e t t l e m e n t   i n   g e t a i l   i s   l e f t   t o   y o u ,   y o u   w 
i l l   i n f o r m   t h e   p r e s i g e n t   o f   t h e   a b o v e   m o s t   s 
e c r e t l y   a s   s o n   a s   t h e   o u t b r e a k   o f   w a r   w i t h   t 
h e   u n i t e g   s t a t e s   o f   a m e r i c a   i s   c e r t a i n   a n g   a 
g g   t h e   s u d d e s t i o n   t h a t   h e   s h o u l g .   o n   h i s   o w n 
  i n i t i a t i v e .   i n v i t e   ( a p a n   t o   i m m e g i a t e   a g h e r 
e n c e   a n g   a t   t h e   s a m e   t i m e   m e g i a t e   b e t w e e n   ( a 
p a n   a n g   o u r s e l v e s ,   p l e a s e   c a l l   t h e   p r e s i g e n t 
" s   a t t e n t i o n   t o   t h e   f a c t   t h a t   t h e   r u t h l e s s   e 
m p l o y m e n t   o f   o u r   s u b m a r i n e s   n o w   o f f e r s   t h e   p 
r o s p e c t   o f   c o m p e l l i n d   e n d l a n g   i n   a   f e w   m o n t h 
s   t o   m a k e   p e a c e ,

A little Googling will tell us that this is “The Zimmerman Telegram.”

Although this is a simple type of cipher that could be done quite easily with simple frequency analysis of the characters, the method is an interesting one.

References:
Diaconis, Persi; “The Markov Chain Monte Carlo Revolution”