Text analytics dfm to matrix problem

Danil · December 14, 2017, 7:33pm

Dear fellow DS and DSDojo team!

I have studied a text analytics tutorial at Data Science Dojo youtube channel and have encountered quite a problem on video #5 https://www.youtube.com/watch?v=az7yf0IfWPM&index=1&list=LLLZutrH5LE6q--0aUaeNNeQ

I’ve worked on the first 5 videos of this course on the spam dataset, and in parallel I am running the code on my dataset which is slightly larger. In particular I have about 1,6 mln observations each one of them containing just about the same amount of text as in the spam observations.

When I created my train.tokens.dfm, I got a large dfm object of 200 bln elements, 420 mb.

dim(train.tokens.dfm)
[1] 1482535 138945
where 1 482 535 is the number of observations and 138 945 is the number of features

Then I proceeded to the next step to create a matrix and a dataframe with:
train.tokens.matrix = as.matrix(train.tokens.dfm)
or
train.tokens.df = cbind(Label = train$Label, as.data.frame(train.tokens.dfm))

so that later I could perform TF IDV SVD imputation, but I encounter (I guess) memory problems

both of these functions return the following mistake:
Cholmod error ‘problem too large’ at file …/Core/cholmod_dense.c, line 105

I can imagine what size the matrix that must be created with these formulas, but I really do want to use tokenization in my approach.

Is there any way around it creating a matrix or a dataframe for me to perform TF IDF SVD calculation?

Thank you in advance!

toobamukhtar · April 16, 2019, 12:45pm

You can maybe use some cloud platforms and shift your code there. Google colaboratory is a good platform if you are having memory issues. It provides you with a free GPU. It’s a Jupyter notebook environment that requires no setup to use.