Summarise data by the label in R

toobamukhtar · February 25, 2019, 6:51pm

I have mostly worked with python all my life and use pandas extensively for data manipulation. Take this dataframe for example:

Name	Income (in thousands)	Gender
John	19	Male
Jane	21	Female
Jonas	18	Male
Jules	25	Female

To see the summary statistics based on class, I can simply use a pandas function.
But how do you do it in R? I have seen a few solutions on the internet but none of them are quite readable or maintainable. Please provide code examples.

Rabeez · February 27, 2019, 4:57am

I’ll use the Iris dataset as an example.

Pandas code to achieve a aggregate summary:

In [6]: iris.groupby('species').mean()
Out[6]:
            sepal_length  sepal_width  petal_length  petal_width
species
setosa             5.006        3.428         1.462        0.246
versicolor         5.936        2.770         4.260        1.326
virginica          6.588        2.974         5.552        2.026

You can do something similar in R using the dplyr package:

iris %>% 
    group_by(Species) %>% 
    summarize(avg.sepal.length = mean(Sepal.Length))

  Species    avg.sepal.length
  <fct>                 <dbl>
1 setosa                 5.01
2 versicolor             5.94
3 virginica              6.59

You would have to give four separate expressions (as arguments) to summarize to get the exact result pandas gave you by default. Well, there is a separate function for when you want to calculate a statistic (or even multiple) over the entire dataframe:

iris %>% 
    group_by(Species) %>% 
    summarize_all(mean)

  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03

The %>% is called the pipe operator and works similar to how you expect the . (dot) operator to work when using pandas. The pipe passes the thing to its left as the first argument to the thing to its right.

Also, RStudio has fantastic cheat sheets about a lot of R data science libraries, which you can use as a reference about the kinds of functions these libraries provide.