Ошибка DocumentTermMatrix в аргументе Corpus

У меня есть следующий код:

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.

corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)

news_dtm <- DocumentTermMatrix(corpus_clean) # errors here

Когда я запускаю метод DocumentTermMatrix(), он дает мне эту ошибку:

Ошибка: inherits (doc, "TextDocument" ) не TRUE

Почему я получаю эту ошибку? Являются ли мои строки не текстовыми документами?

Вот результат при проверке corpus_clean:

[[153]]
[1] obama holds technical school model us

[[154]]
[1] oil boom produces jobs bonanza archaeologists

[[155]]
[1] islamic terrorist group expands territory captures tikrit

[[156]]
[1] republicans democrats feel eric cantors loss

[[157]]
[1] tea party candidates try build cantor loss

[[158]]
[1] vehicles materials stored delaware bridges

[[159]]
[1] hill testimony hagel defends bergdahl trade

[[160]]
[1] tweet selfpropagates tweetdeck

[[161]]
[1] blackwater guards face trial iraq shootings

[[162]]
[1] calif man among soldiers killed afghanistan

[[163]]
[1] stocks fall back world bank cuts growth outlook

[[164]]
[1] jabhat alnusra longer useful turkey

[[165]]
[1] catholic bishops keep focus abortion marriage

[[166]]
[1] barbra streisand visits hill heart disease

[[167]]
[1] rand paul cantors loss reason stop talking immigration

[[168]]
[1] israeli airstrike kills northern gaza

Изменить: Вот мои данные:

type,text
neutral,The week in 32 photos
neutral,Look at me! 22 selfies of the week
neutral,Inside rebel tunnels in Homs
neutral,Voices from Ukraine
neutral,Water dries up ahead of World Cup
positive,Who your hero? Nominate them
neutral,Anderson Cooper: Here how
positive,"At fire scene, she rescues the pet"
neutral,Hunger in the land of plenty
positive,Helping women escape 'the life'
neutral,A tour of the sex underworld
neutral,Miss Universe Thailand steps down
neutral,China 'naked officials' crackdown
negative,More held over Pakistan stoning
neutral,Watch landmark Cold War series
neutral,In photos: History of the Cold War
neutral,Turtle predicts World Cup winner
neutral,What devoured great white?
positive,Nun wins Italy 'The Voice'
neutral,Bride Price app sparks debate
neutral,China to deport 'pork' artist
negative,Lightning hits moving car
neutral,Singer won't be silenced
neutral,Poland mini desert
neutral,When monarchs retire
negative,Murder on Street View?
positive,Meet armless table tennis champ
neutral,Incredible 400 year-old globes
positive,Man saves falling baby
neutral,World most controversial foods

Что я получаю как:

news_raw <- read.csv('news_csv.csv', stringsAsFactors = F)

Изменить: Вот трассировка():

> news_dtm <- DocumentTermMatrix(corpus_clean)
Error: inherits(doc, "TextDocument") is not TRUE
> traceback()
9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), 
       ch), call. = FALSE, domain = NA)
8: stopifnot(inherits(doc, "TextDocument"), is.list(control))
7: FUN(X[[1L]], ...)
6: lapply(X, FUN, ...)
5: mclapply(unname(content(x)), termFreq, control)
4: TermDocumentMatrix.VCorpus(x, control)
3: TermDocumentMatrix(x, control)
2: t(TermDocumentMatrix(x, control))
1: DocumentTermMatrix(corpus_clean)

Когда я оцениваю inherits(corpus_clean, "TextDocument"), это ЛОЖЬ.

Ответ 1

Кажется, что это отлично сработало бы в tm 0.5.10, но изменения в tm 0.6.0, кажется, сломали его. Проблема в том, что функции tolower и trim не обязательно возвращают TextDocuments (похоже, что более старая версия, возможно, автоматически выполнила преобразование). Вместо этого они возвращают символы, а DocumentTermMatrix не уверен, как обрабатывать корпус символов.

Итак, вы можете перейти на

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

Или вы можете запустить

corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

после выполнения всех ваших нестандартных преобразований (не в getTransformations()) и перед созданием DocumentTermMatrix. Это должно убедиться, что все ваши данные находятся в PlainTextDocument и должны сделать DocumentTermMatrix счастливым.

Ответ 2

Я нашел способ решить эту проблему в статье о TM.

Пример, в котором следующая ошибка:

getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1") # import files
corpus <- VCorpus(x=files) # load files, create corpus

summary(corpus) # get a summary
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation);
matrix_terms <- DocumentTermMatrix(corpus)

Предупреждающие сообщения:

В TermDocumentMatrix.VCorpus(x, control): недопустимые идентификаторы документа

Эта ошибка возникает из-за того, что вам нужен объект класса Vector Source для выполнения вашей матрицы документа документа, но предыдущие преобразования преобразуют ваш корпус текстов в характер, следовательно, изменяя класс, который не принимается функцией.

Однако, если вы добавите функцию content_transformer внутри команды tm_map, вам может не понадобиться еще одна команда, прежде чем использовать функцию TermDocumentMatrix для продолжения.

Код ниже изменяет класс (см. вторую последнюю строку) и избегает ошибки:

getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1")
corpus <- VCorpus(x=files) # load files, create corpus

summary(corpus) # get a summary
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- tm_map(corpus,content_transformer(stripWhitespace))
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- Corpus(VectorSource(corpus)) # change class 
matrix_term <- DocumentTermMatrix(corpus)

Ответ 3

Измените это:

corpus_clean <- tm_map(news_corpus, tolower)

Для этого:

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

Ответ 4

Это должно сработать.

remove.packages(tm)
install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.zip",repos=NULL)
library(tm)