# A Sequence Clustering Model in R

I’ve just released my first R package!
Over the past 1.5 years or so, I’ve been studying an obscure statistical model over ranking (full, or partial) data called Mallows’ model.  It hypothesizes that a set of sequence data has a “modal” sequence about which the data cluster, and that the data fall away from that modal sequence with probability $\exp{-\lambda}$, where $\lambda$ is derived from the data.

There is a closed form solution to determining the $\lambda$ parameter and the modal sequence, as outlined in my third reference in the package manual . However, we often have data that are heterogeneous, and it makes more sense to hypothesize multiple modal sequences. That is where this package comes in.

The RMallow package can take any set of sequence data of reasonable size (currently I have gotten it to process data sets with 2000 + rows and 20 + attributes in under a minute), and fit a Multi-Modal Mallows’ Model to the data using an EM algorithm as described in my first reference in the package summary.

Certain conditions have to be met for the model to fit very well. For example, if there happen to be multiple modal sequences, but they are very close together (as determined by a distance metric called Kendall’s metric, which is equivalent to Bubble Sort in a way), Mallows’ model may simply group them all into the same “cluster.” Here is an example where I fit the model to some crude synthetic data, and discover two constructed modal sequences. The sequences are no more than distance 20 apart (the maximum possible distance in this space is 20*(20-1)/2 = 190), and it still groups the data very well!

The idea of the construction of the data in the following is that we create two modal sequences for the code to find. I then construct a good number of identical sequences to the modal ones, and randomly transpose between 1 and 10 of the elements of each (uniformly, which I stress is NOT how Mallows’ model works…). I then add a good deal of noise in partial rankings (rankings where there are no full orderings between certain groups). The final call is the code to fit the model.

# Some synthetic data to represent the capability of RMallow
library(RMallow)
modal <- 1:20
second.modal <- c(1, 2, 3, 4, 5, 17, 7, 9, 8, 11, 10,
12, 13, 14, 15, 16, 6, 18, 19, 20)
partial.noise <- list()
modal.noise <- list()
second.noise <- list()
for (i in 1:300) {
# Partial ranking noise
partial.noise[[i]] <- sample(20, replace = TRUE)
# Modal sequence 1
modal.noise[[i]] <- modal
# Modal sequence 2
second.noise[[i]] <- second.modal
# Choose to switch between 1 and 10 random indicies
swit <- sample(20, sample(10, 1))
swit.2 <- sample(20, sample(10, 1))
modal.noise[[i]][swit] <- modal.noise[[i]][rev(swit)]
second.noise[[i]][swit.2] <- second.modal[rev(swit.2)]
}
partial.noise <- do.call("rbind", partial.noise)
partial.noise <- SimplifySequences(partial.noise)
modal.noise <- do.call("rbind", modal.noise)
second.noise <- do.call("rbind", second.noise)
datas <- do.call("rbind", list(modal.noise, partial.noise, second.noise))
datas <- datas[sample(nrow(datas)), ]
test <- Mallows(datas, iter = 100, G = 2)


Now for the moment of truth, to determine whether the code found the sequences. The output is a list, where the first element is the modal sequences found, second is the proportion of data assigned to each cluster, third is the cluster membership probabilities for each row, and fourth a data frame with the data, probabilities of cluster memberships, cluster assignments, and distance from each modal sequence.

> test[[1]]
[[1]]
[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

[[2]]
[1]  1  2  3  4  5 17  7  9  8 11 10 12 13 14 15 16  6 18 19 20


GREAT SUCCESS! I purposely did not set a seed here because when I ran the code above, it actually did not produce what I was hoping for. I DID get that output when I ran a function I will put into the next release, which runs the algorithm several times and selects the best fit, based on the log-likelihood of the data in the model. EM Algorithms are prone to local maxima, so this is a necessary fix. You may very well get the output above when you run that code on your machine, depending on the state of your system. If not, try running it 5 times or so and it will most likely work out!

So there we have it. I have demonstrated that this package has some value in mining patterns from sequence data with noise. I will be working on implementing some solutions to increase the probability of convergence, and trying to extend the model to larger sequence spaces soon. Also, I will be providing a Goodness Of Fit Assessment of the model, when our data satisfies certain conditions (sample size, …). Finally, I hope to soon be able to demonstrate the value of this model in real-world data in my next post.

# A spell-checker in R

I came across Dr. Peter Norvig’s blog about writing a basic spell-checker (http://norvig.com/spell-correct.html), and just had to try to implement it in R. Please excuse the ugly-ish code (I have not optimized it or commented it adequately at this point, but you can get the idea of what it does by reading Dr. Norvig’s blog). If anyone knows of any pre-built spell-checker packages in R, please let me know in a comment!

I do not think R is a particularly good language for this sort of activity, but I got it to work out fine. The first few lines here create a list of common words, and their frequencies in the English language. The following lines may take a few minutes to run on an average machine, but I will try to upload them soon so that you can just download the table instead of creating it yourself…

words <- scan("http://norvig.com/big.txt", what = character())
words <- strip.text(words)
counts <- table(words)


Next, here are the functions we need to do the spell-check operations…

# This is a text processing function, which I
# borrowed from a CMU Data mining course professor.
strip.text <- function(txt) {
# remove apostrophes (so "don't" -> "dont", "Jane's" -> "Janes", etc.)
txt <- gsub("'","",txt)
# convert to lowercase
txt <- tolower(txt)
# change other non-alphanumeric characters to spaces
txt <- gsub("[^a-z0-9]"," ",txt)
# change digits to #
txt <- gsub("[0-9]+"," ",txt)
# split and make one vector
txt <- unlist(strsplit(txt," "))
# remove empty words
txt <- txt[txt != ""]
return(txt)
}

# Words within 1 transposition.
Transpositions <- function(word = FALSE) {
N <- nchar(word)
if (N > 2) {
out <- rep(word, N - 1)
word <- unlist(strsplit(word, NULL))
# Permutations of the letters
perms <- matrix(c(1:(N - 1), 2:N), ncol = 2)
reversed <- perms[, 2:1]
trans.words <- matrix(rep(word, N - 1), byrow = TRUE, nrow = N - 1)
for(i in 1:(N - 1)) {
trans.words[i, perms[i, ]] <- trans.words[i, reversed[i, ]]
out[i] <- paste(trans.words[i, ], collapse = "")
}
}
else if (N == 2) {
out <- paste(word[2:1], collapse = "")
}
else {
out <- paste(word, collapse = "")
}
return(out)
}

# Single letter deletions.
# Thanks to luiscarlosmr for partial correction in comments
Deletes <- function(word = FALSE) {
N <- nchar(word)
out<-mat.or.vec(1,N)
word <- unlist(strsplit(word, NULL))
for(i in 1:N) {
out[i] <- paste(word[-i], collapse = "")
}
return(out)
}

# Single-letter insertions.
Insertions <- function(word = FALSE) {
N <- nchar(word)
out <- list()
for (letter in letters) {
out[[letter]] <- rep(word, N + 1)
for (i in 1:(N + 1)) {
out[[letter]][i] <- paste(substr(word, i - N, i - 1), letter,
substr(word, i, N), sep = "")
}
}
out <- unlist(out)
return(out)
}

# Single-letter replacements.
Replaces <- function(word = FALSE) {
N <- nchar(word)
out <- list()
for (letter in letters) {
out[[letter]] <- rep(word, N)
for (i in 1:N) {
out[[letter]][i] <- paste(substr(word, i - N, i - 1), letter,
substr(word, i + 1, N + 1), sep = "")
}
}
out <- unlist(out)
return(out)
}
# All Neighbors with distance "1"
Neighbors <- function(word) {
neighbors <- c(word, Replaces(word), Deletes(word),
Insertions(word), Transpositions(word))
return(neighbors)
}

# Probability as determined by our corpus.
Probability <- function(word, dtm) {
# Number of words, total
N <- length(dtm)
word.number <- which(names(dtm) == word)
count <- dtm[word.number]
pval <- count/N
return(pval)
}

# Correct a single word.
Correct <- function(word, dtm) {
neighbors <- Neighbors(word)
# If it is a word, just return it.
if (word %in% names(dtm)) {
out <- word
}
# Otherwise, check for neighbors.
else {
# Which of the neighbors are known words?
known <- which(neighbors %in% names(dtm))
N.known <- length(known)
# If there are no known neighbors, including the word,
# look farther away.
if (N.known == 0) {
print(paste("Having a hard time matching '", word, "'...", sep = ""))
neighbors <- unlist(lapply(neighbors, Neighbors))
}
# Then out non-words.
neighbors <- neighbors[which(neighbors %in% names(dtm))]
N <- length(neighbors)
# If we found some neighbors, find the one with the highest
# p-value.
if (N >= 1) {
P <- 0*(1:N)
for (i in 1:N) {
P[i] <- Probability(neighbors[i], dtm)
}
out <- neighbors[which.max(P)]
}
# If no neighbors still, return the word.
else {
out <- word
}
}
return(out)
}

# Correct an entire document.
CorrectDocument <- function(document, dtm) {
by.word <- unlist(strsplit(document, " "))
N <- length(by.word)
for (i in 1:N) {
by.word[i] <- Correct(by.word[i], dtm = dtm)
}
corrected <- paste(by.word, collapse = " ")
return(corrected)
}


The above functions generate “neighbors” of words, determine probabilities of the neighbors, and return the best ones. Function “CorrectDocument” will correct an entire document (with special characters and punctuation removed), and “Correct” will simply correct a word. Here are some sample runs.

> Correct("speling", dtm = counts)
l4
"spelling"
> Correct("korrecter", dtm = counts)
[1] "Having a hard time matching 'korrecter'..."
c1.d9
"corrected"
> CorrectDocument("the quick bruwn fowx jumpt ovre tha lasy dog", dtm = counts)
[1] "the quick brown fox jump over the last dog"


As you can see, this function is obviously not perfect. It will do some basic corrections automatically though, but there are some improvements to be made. More to come!