Dear fellow DS and DSDojo team!
I have studied a text analytics tutorial at Data Science Dojo youtube channel and have encountered quite a problem on video #5 https://www.youtube.com/watch?v=az7yf0IfWPM&index=1&list=LLLZutrH5LE6q--0aUaeNNeQ
I’ve worked on the first 5 videos of this course on the spam dataset, and in parallel I am running the code on my dataset which is slightly larger. In particular I have about 1,6 mln observations each one of them containing just about the same amount of text as in the spam observations.
When I created my train.tokens.dfm, I got a large dfm object of 200 bln elements, 420 mb.
dim(train.tokens.dfm)
[1] 1482535 138945
where 1 482 535 is the number of observations and 138 945 is the number of features
Then I proceeded to the next step to create a matrix and a dataframe with:
train.tokens.matrix = as.matrix(train.tokens.dfm)
or
train.tokens.df = cbind(Label = train$Label, as.data.frame(train.tokens.dfm))
so that later I could perform TF IDV SVD imputation, but I encounter (I guess) memory problems
both of these functions return the following mistake:
Cholmod error ‘problem too large’ at file …/Core/cholmod_dense.c, line 105
I can imagine what size the matrix that must be created with these formulas, but I really do want to use tokenization in my approach.
Is there any way around it creating a matrix or a dataframe for me to perform TF IDF SVD calculation?
Thank you in advance!