Dear fellow DS and DSDojo team!

I have studied a text analytics tutorial at Data Science Dojo youtube channel and have encountered quite a problem on video #5 https://www.youtube.com/watch?v=az7yf0IfWPM&index=1&list=LLLZutrH5LE6q--0aUaeNNeQ

I’ve worked on the first 5 videos of this course on the spam dataset, and in parallel I am running the code on my dataset which is slightly larger. In particular I have about 1,6 mln observations each one of them containing just about the same amount of text as in the spam observations.

When I created my train.tokens.dfm, I got a large dfm object of 200 bln elements, 420 mb.

dim(train.tokens.dfm)

[1] 1482535 138945

where 1 482 535 is the number of observations and 138 945 is the number of features

Then I proceeded to the next step to create a matrix and a dataframe with:

train.tokens.matrix = as.matrix(train.tokens.dfm)

or

train.tokens.df = cbind(Label = train$Label, as.data.frame(train.tokens.dfm))

so that later I could perform TF IDV SVD imputation, but I encounter (I guess) memory problems

both of these functions return the following mistake:

Cholmod error ‘problem too large’ at file …/Core/cholmod_dense.c, line 105

I can imagine what size the matrix that must be created with these formulas, but I really do want to use tokenization in my approach.

Is there any way around it creating a matrix or a dataframe for me to perform TF IDF SVD calculation?

Thank you in advance!