Tag Archives: dplyr

Database Reflection using dplyr

At work I write a ton of SQL, and I do most of my querying using R.  The workflow goes:

  1. Create a string with the SQL in R
  2. Plug the string into fetchQuery (see my previous post)

This solution works relatively well, but i’m a bit unhappy writing strings rather than using function calls.

I began working on my own ORM implementation, but it was very slow-go and it would have taken a lot of time to get anywhere.  Luckily, I was pleasantly surprised that Hadley Wickham’s new dplyr package implements much of the ORM I was hoping for.

One thing I want out of an ORM is the ability to see every table in our databases all at once.  That functionality, while implement-able using the dplyr package, would likely take quite a while on tens of thousands of tables.  So I decided to implement the reflection myself.

 

 

#' Get the table information for a postgres database
#' @param config the configuration list
#' @return the table names, columns, and column types of all columns in the database
getTableInformation <- function(config = config.gp) {
  tables <- fetchQuery(
    "SELECT table_name, column_name, data_type 
    FROM information_schema.columns 
    WHERE table_name NOT LIKE '%prt%' 
      AND table_name NOT LIKE '%ext%' 
      AND table_name NOT LIKE '%tmp%' 
    ORDER BY 1, 2",
    config
  )
}

#' Replacement of the normal update function, you don't need to call this.
update <- function(object, ...) {
    args <- list(...)
    for (nm in names(args)) {
        object[[nm]] <- args[[nm]]
    }
    if (is.null(object$select)) {
        if (is.ident(object$from)) {
            var_names <- object$select
        }
        else {
            var_names <- qry_fields(object$src$con, object$from)
        }
        vars <- lapply(var_names, as.name)
        object$select <- vars
    }
    object$query <- dplyr:::build_query(object)
    object
}
#' Function to reflect a database, generalizable to others beyond postgres 
#' by simply changing getTableInformation appropriately
reflectDatabase <- function(config, envir.name = "tables",
                            subclass = "postgres") {
  if (!(envir.name %in% search())) {
    envir <- new.env(parent = .GlobalEnv)
  } else {
    envir <- as.environment(envir.name)
    detach(envir.name, character.only = TRUE)
  }
  src <- do.call(src_postgres, config)
  tables <- getTableInformation(config)
  tables <- split(tables, tables$table_name)
  lapply(tables, function(i) {
    nm <- ident(i$table_name[1])
    vars <- lapply(i$column_name, as.name)
    tbl <- dplyr::make_tbl(c(subclass, "sql"), src = src, from = nm,
                    select = vars, summarise = FALSE, mutate = FALSE,
                    where = NULL, group_by = NULL, order_by = NULL)
    tbl <- update(tbl)
    assign(
      nm,
      tbl,
      envir = envir
    )
  })
  attach(envir, name = envir.name)
}

searchTables <- function(str, env = "tables") {
  all.tbls <- ls(env)
  all.tbls[grep(str, all.tbls)]
}

To use this function, you can simply call (with your config, as specified in my last post, included)

reflectDatabase(config)

and if you’re using a Postgres database, that should be it!

The fun part now, is that I can do things like

res <- inner_join(my_table_1, my_table_2)

where my_table_1 and my_table_2 are simply names of tables in my database. This provides me with auto-complete of table names, search-able table names and columns, etc.

For example:

searchTables('user')

returns all tables in our database with the string “user” in them.

These are some things I hope to see or find in dplyr, and may try to build myself if they don’t already exist:
1. Case statements in mutate
2. Creating table indexes
3. type checking of columns, and more informative error messages when un-sensible joins and filters are performed.

Overall this package seems like a lot of fun, and i’m excited to try to work it into my coding!