Filtering Valid Emails from a Series

In this thread, several methods would be discussed for filtering our valid emails from a series. Filtering valid emails is important as it can be time-consuming to manually separate non-valid emails from valid emails in a big dataset, it can also let to inaccurate results if analysts want to work on valid emails only and there are non-valid emails in the data. If you want to learn how you can filter values and words based on certain criteria, you can have a look at the following threads:

  1. Filtering out values from a series.
  2. Filtering words from a series.

To identify valid email addresses from a given dataset, the most effective approach is to use a regular expression pattern that matches and verifies the standard email format. This pattern will be used consistently across all the methods discussed below and you’ll learn different functions of applying the same pattern to your data.

1. Using "str.match()" method:

  • The str.match() is a Pandas method that checks whether each string element in a Pandas series matches a regular expression pattern or not. It returns a Boolean array that indicates whether each element matches the pattern or not.
  • The argument na = False does not apply the pattern on missing or null values and simply returns a False against them.

2. Using "re.match()" and "apply()":

  • The re.match() method searches for a pattern at the beginning of a string and returns a match object if the pattern is found, otherwise, it returns None.
  • The re.match() method is used in a simple lambda function which is then applied to the series using the apply() method that allows applying functions on each element of a Pandas dataframe or series.

3. Using "str.findall()" method:

  • The str.findall() method returns all non-overlapping matches of a regular expression pattern in a string. We can use this method to extract all the email addresses in a Pandas series and filter out the invalid ones.
  • The result returned from this method has empty values where invalid emails were found, and to filter the valid values, a lambda function is applied to the series using the apply() method.

4. Using list comprehension:

  • In this method, the re.match() method is applied to each email individually to return it if the pattern is found in that email.
  • Since, list comprehension is used, the result would be in the form of a list and if you want to convert it into a series, simply use the pd.Series() constructor and pass the list.

5. Using "str.contains()" method:

  • The str.contains() method is used in this example with the regex=True parameter to check if each element in the series matches the regular expression pattern and this method returns a boolean result.
  • The na=False parameter ensures that missing values are not considered valid email addresses.
  • The resulting boolean mask is then used to filter out the invalid emails from the original series.

6. Using "filter()" method:

  • We can also use the built-in filter() function along with the re.match() method to filter out valid email addresses from a Pandas series.
  • The result is in the form of a list and to convert it into a series, we simply have to use the pd.Series() constructor.