A spell-checker in R

I came across Dr. Peter Norvig’s blog about writing a basic spell-checker (http://norvig.com/spell-correct.html), and just had to try to implement it in R. Please excuse the ugly-ish code (I have not optimized it or commented it adequately at this point, but you can get the idea of what it does by reading Dr. Norvig’s blog). If anyone knows of any pre-built spell-checker packages in R, please let me know in a comment!

I do not think R is a particularly good language for this sort of activity, but I got it to work out fine. The first few lines here create a list of common words, and their frequencies in the English language. The following lines may take a few minutes to run on an average machine, but I will try to upload them soon so that you can just download the table instead of creating it yourself…

words <- scan("http://norvig.com/big.txt", what = character())
words <- strip.text(words)
counts <- table(words)

Next, here are the functions we need to do the spell-check operations…

# This is a text processing function, which I
# borrowed from a CMU Data mining course professor.
strip.text <- function(txt) {
  # remove apostrophes (so "don't" -> "dont", "Jane's" -> "Janes", etc.)
  txt <- gsub("'","",txt)
  # convert to lowercase
  txt <- tolower(txt)
  # change other non-alphanumeric characters to spaces
  txt <- gsub("[^a-z0-9]"," ",txt)
  # change digits to #
  txt <- gsub("[0-9]+"," ",txt)
  # split and make one vector
  txt <- unlist(strsplit(txt," "))
  # remove empty words
  txt <- txt[txt != ""]
  return(txt)
}

# Words within 1 transposition.
Transpositions <- function(word = FALSE) {
  N <- nchar(word)
  if (N > 2) {
    out <- rep(word, N - 1)
    word <- unlist(strsplit(word, NULL))
    # Permutations of the letters
    perms <- matrix(c(1:(N - 1), 2:N), ncol = 2)
    reversed <- perms[, 2:1]
    trans.words <- matrix(rep(word, N - 1), byrow = TRUE, nrow = N - 1)
    for(i in 1:(N - 1)) {
      trans.words[i, perms[i, ]] <- trans.words[i, reversed[i, ]]
      out[i] <- paste(trans.words[i, ], collapse = "")
    }
  }
  else if (N == 2) {
    out <- paste(word[2:1], collapse = "")
  }
  else {
    out <- paste(word, collapse = "")
  }
  return(out)
}

# Single letter deletions.
# Thanks to luiscarlosmr for partial correction in comments
Deletes <- function(word = FALSE) {
  N <- nchar(word)
  out<-mat.or.vec(1,N)
  word <- unlist(strsplit(word, NULL))
  for(i in 1:N) {
    out[i] <- paste(word[-i], collapse = "")
  }
  return(out)
}

# Single-letter insertions.
Insertions <- function(word = FALSE) {
  N <- nchar(word) 
  out <- list()
  for (letter in letters) {
    out[[letter]] <- rep(word, N + 1)
    for (i in 1:(N + 1)) {
      out[[letter]][i] <- paste(substr(word, i - N, i - 1), letter, 
                                substr(word, i, N), sep = "")
    }
  }
  out <- unlist(out)
  return(out)
}

# Single-letter replacements.
Replaces <- function(word = FALSE) {
  N <- nchar(word) 
  out <- list()
  for (letter in letters) {
    out[[letter]] <- rep(word, N)
    for (i in 1:N) {
      out[[letter]][i] <- paste(substr(word, i - N, i - 1), letter, 
                                substr(word, i + 1, N + 1), sep = "")
    }
  }
  out <- unlist(out)
  return(out)
}
# All Neighbors with distance "1"
Neighbors <- function(word) {
  neighbors <- c(word, Replaces(word), Deletes(word),
                 Insertions(word), Transpositions(word))
  return(neighbors)
}

# Probability as determined by our corpus.
Probability <- function(word, dtm) {
  # Number of words, total
  N <- length(dtm)
  word.number <- which(names(dtm) == word)
  count <- dtm[word.number]
  pval <- count/N
  return(pval)
}

# Correct a single word.
Correct <- function(word, dtm) {
  neighbors <- Neighbors(word)
  # If it is a word, just return it.
  if (word %in% names(dtm)) {
    out <- word
  }
  # Otherwise, check for neighbors.
  else {
    # Which of the neighbors are known words?
    known <- which(neighbors %in% names(dtm))
    N.known <- length(known)
    # If there are no known neighbors, including the word,
    # look farther away.
    if (N.known == 0) {
      print(paste("Having a hard time matching '", word, "'...", sep = ""))
      neighbors <- unlist(lapply(neighbors, Neighbors))
    }
    # Then out non-words.
    neighbors <- neighbors[which(neighbors %in% names(dtm))]
    N <- length(neighbors)
    # If we found some neighbors, find the one with the highest
    # p-value.
    if (N >= 1) {
      P <- 0*(1:N)
      for (i in 1:N) {
        P[i] <- Probability(neighbors[i], dtm)
      }
      out <- neighbors[which.max(P)]
    }
    # If no neighbors still, return the word.
    else {
      out <- word
    }
  }
  return(out)
}

# Correct an entire document.
CorrectDocument <- function(document, dtm) {
  by.word <- unlist(strsplit(document, " "))
  N <- length(by.word)
  for (i in 1:N) {
    by.word[i] <- Correct(by.word[i], dtm = dtm)
  }
  corrected <- paste(by.word, collapse = " ")
  return(corrected)
}

The above functions generate “neighbors” of words, determine probabilities of the neighbors, and return the best ones. Function “CorrectDocument” will correct an entire document (with special characters and punctuation removed), and “Correct” will simply correct a word. Here are some sample runs.

> Correct("speling", dtm = counts)
        l4 
"spelling" 
> Correct("korrecter", dtm = counts)
[1] "Having a hard time matching 'korrecter'..."
      c1.d9 
"corrected" 
> CorrectDocument("the quick bruwn fowx jumpt ovre tha lasy dog", dtm = counts)
[1] "the quick brown fox jump over the last dog"

As you can see, this function is obviously not perfect. It will do some basic corrections automatically though, but there are some improvements to be made. More to come!

Advertisements

8 thoughts on “A spell-checker in R

    1. RGuy Post author

      Yes, I had come across that in my search. I didn’t think much about it since you cannot directly supply it with a string and have it corrected, but it could easily perform the same function (and better, I’m sure) if I just output the string to a file and then read it in using aspell. I will likely do the latter if I need any serious spell-correction done with R, but learning the idea behind very basic spell-checking was interesting!

      Reply
  1. luiscarlosmr

    Kindly, let me add something about this code:

    # Deletes
    # TRY: word[-i]
    Deletes <- function(word = FALSE) {
    N <- nchar(word)
    out<-mat.or.vec(1,length(N))
    word <- unlist(strsplit(word, NULL))
    for(i in 1:length(N)) {
    out[i] <- paste(word[-i], collapse = "")
    }
    return(out)
    }

    As you can see I just added a line. Have a great time and this code is really interesting to me. Thank you.

    Reply
  2. Deepak

    Actually, a small typo in the final module, you need to change
    by.word <- unlist(strsplit(essay, " "))
    to by.word <- unlist(strsplit(document, " "))……………….//////the full version of the module is

    # Correct an entire document.
    CorrectDocument <- function(document, dtm) {
    by.word <- unlist(strsplit(document, " "))
    N <- length(by.word)
    for (i in 1:N) {
    by.word[i] <- Correct(by.word[i], dtm = dtm)
    }
    corrected <- paste(by.word, collapse = " ")
    return(corrected)
    }

    Reply
      1. Dan

        Awesome! I’d love to use it in a project! Unfortunately, “no license associated with it” doesn’t exactly mean “free to use” because it’s copyrighted by default. Do you mean that you release it into the public domain? Thanks again, this was an awesome lesson, and I’d love to be able to adapt it for a need of mine.

  3. Ankit

    Hello, I am working on a project where I need to build a spell checker and that’s how I stumbled onto your blog post. But I’m facing an issue when I am trying to execute the code. R gives me the following error:

    Error in out[i] <- paste(word[-i], collapse = "") :
    object 'out' not found

    Do you have any idea what could be the reason for such an error?

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s