In this thread, several methods would be discussed for filtering our valid emails from a series. Filtering valid emails is important as it can be time-consuming to manually separate non-valid emails from valid emails in a big dataset, it can also let to inaccurate results if analysts want to work on valid emails only and there are non-valid emails in the data. If you want to learn how you can filter values and words based on certain criteria, you can have a look at the following threads:
To identify valid email addresses from a given dataset, the most effective approach is to use a regular expression pattern that matches and verifies the standard email format. This pattern will be used consistently across all the methods discussed below and you’ll learn different functions of applying the same pattern to your data.
1. Using "str.match()" method:
- The
str.match()
is a Pandas method that checks whether each string element in a Pandas series matches a regular expression pattern or not. It returns a Boolean array that indicates whether each element matches the pattern or not. - The argument
na = False
does not apply the pattern on missing or null values and simply returns aFalse
against them.
2. Using "re.match()" and "apply()":
- The
re.match()
method searches for a pattern at the beginning of a string and returns a match object if the pattern is found, otherwise, it returnsNone
. - The
re.match()
method is used in a simple lambda function which is then applied to the series using theapply()
method that allows applying functions on each element of a Pandas dataframe or series.
3. Using "str.findall()" method:
- The
str.findall()
method returns all non-overlapping matches of a regular expression pattern in a string. We can use this method to extract all the email addresses in a Pandas series and filter out the invalid ones. - The result returned from this method has empty values where invalid emails were found, and to filter the valid values, a lambda function is applied to the series using the
apply()
method.
4. Using list comprehension:
- In this method, the
re.match()
method is applied to each email individually to return it if the pattern is found in that email. - Since, list comprehension is used, the result would be in the form of a list and if you want to convert it into a series, simply use the
pd.Series()
constructor and pass the list.
5. Using "str.contains()" method:
- The
str.contains()
method is used in this example with theregex=True
parameter to check if each element in the series matches the regular expression pattern and this method returns a boolean result. - The
na=False
parameter ensures that missing values are not considered valid email addresses. - The resulting boolean mask is then used to filter out the invalid emails from the original series.
6. Using "filter()" method:
- We can also use the built-in
filter()
function along with there.match()
method to filter out valid email addresses from a Pandas series. - The result is in the form of a list and to convert it into a series, we simply have to use the
pd.Series()
constructor.