In Scikit-learn
, Vectorizing text columns using Sci-kit learn can be a powerful tool for natural language processing tasks such as sentiment analysis, topic modeling, and text classification. Vectorization involves converting text data into numerical features that machine learning algorithms can understand and analyze. By using vectorization, you can efficiently process and analyze large amounts of text data, making it an essential tool for any NLP project. In this thread, we will learn how you can vectorize two text columns in various ways using a ColumnTransformer
.
1. "CountVectorizer":
CountVectorizer
is a simple method for converting text into a bag-of-words representation.
It counts the number of occurrences of each word in a text and creates a sparse matrix with one row per document and one column per unique word in the corpus.
Let’s see the example given below to gain a better understanding:
-
The above imports the required libraries -
CountVectorizer
,ColumnTransformer
,Pipeline
andpandas
. -
It then creates a small dataset with two columns, ‘text1’ and ‘text2’, each containing two sentences.
-
It applies the
ColumnTransformer
to the dataset.ColumnTransformer
is a way to apply different transformers to different columns of a dataset. In this case, it applies theCountVectorizer
transformer to each column of the dataset. -
It then creates a Pipeline object that includes the
ColumnTransformer
object created in the previous step. -
Fitting and transforming the data: Finally, the code fits the pipeline to the dataset and transforms the data. Fitting a pipeline means training the transformer(s) on the data. Transforming the data means applying the transformer(s) to the data and converting it into numerical features.
2. "TfidfVectorizer":
TfidfVectorizer
is similar to CountVectorizer
, but it also takes into account the frequency of a word in the corpus. TfidfVectorizer
- This method computes the TF-IDF
(term frequency-inverse document frequency) weight of each word in the text and represents it as a vector.
Let’s see the example given below to gain a better understanding:
-
The above code imports the required libraries - TfidfVectorizer, ColumnTransformer, Pipeline and pandas.
-
It then creates a small dataset with two columns, ‘text1’ and ‘text2’, each containing two sentences.
-
It applies the ColumnTransformer to the dataset. ColumnTransformer is a way to apply different transformers to different columns of a dataset. In this case, it applies the
TfidfVectorizer
transformer to each column of the dataset. -
It creates a Pipeline object that includes the ColumnTransformer object created in the previous step.
-
Finally, it fits the pipeline to the dataset and transforms the data. Fitting a pipeline means training the transformer(s) on the data. Transforming the data means applying the transformer(s) to the data and converting it into numerical features.
3. "HashingVectorizer":
HashingVectorizer
is a method for vectorizing text that uses a hash function to map words to columns in a fixed-size matrix. This avoids the need to store the vocabulary in memory, but it can lead to collisions where different words are mapped to the same column.
Let’s see the example given below to gain better understanding:
In the above code,
-
First, we import the required libraries for building the pipeline:
HashingVectorizer
,ColumnTransformer
,Pipeline
, andpandas
. -
Next, we create a DataFrame
df
with two columns (‘text1’ and ‘text2’) and two rows containing some sample text data. -
Then, we define a
ColumnTransformer
objectct
that appliesHashingVectorizer
to each text column in thedf
DataFrame.HashingVectorizer
is used to convert text data into numerical features that can be fed into a machine learning model. TheColumnTransformer
takes a list of tuples, each tuple containing a unique name for the transformation (‘text1_hash’, ‘text2_hash’), the transformer object (HashingVectorizer), and the name of the column in the DataFrame to be transformed (‘text1’, ‘text2’). -
Next, we create a
Pipeline
objectpipeline
that applies theColumnTransformer
ct
to the data. A pipeline allows us to chain multiple transformations and apply them in a sequence. -
Finally, we fit and transform the
df
DataFrame using thepipeline
object, which applies theHashingVectorizer
transformation to each text column in the DataFrame. -
The output of
pipeline.fit_transform(df)
is a sparse matrix where each row corresponds to a row in the input DataFramedf
, and each column corresponds to a hashed feature generated from the text data.