What is the concept of stratified sampling, and how can I implement code for its application? Could you provide guidance on effectively incorporating this logic?
Stratified sampling is a statistical sampling technique used to ensure that the representation of different groups or classes in a population is maintained in the sample. To understand the concept practically let’s look into this example:
The code utilizes train_test_split
from sklearn.model_selection
to split the iris dataset into training and testing sets. Setting test_size
to 0.3 allocates 30% for testing and 70% for training. With stratify=y
, the split maintains class proportions. The function returns X_train
, X_test
, y_train
, and y_test
. Finally, the code prints the sample counts in each set using len
applied to X_train
and X_test
.