I’ll use the Iris dataset as an example.
Pandas code to achieve a aggregate summary:
In [6]: iris.groupby('species').mean()
Out[6]:
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
You can do something similar in R using the dplyr
package:
iris %>%
group_by(Species) %>%
summarize(avg.sepal.length = mean(Sepal.Length))
Species avg.sepal.length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
You would have to give four separate expressions (as arguments) to summarize
to get the exact result pandas gave you by default. Well, there is a separate function for when you want to calculate a statistic (or even multiple) over the entire dataframe:
iris %>%
group_by(Species) %>%
summarize_all(mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246
2 versicolor 5.94 2.77 4.26 1.33
3 virginica 6.59 2.97 5.55 2.03
The %>%
is called the pipe operator and works similar to how you expect the .
(dot) operator to work when using pandas. The pipe passes the thing to its left as the first argument to the thing to its right.
Also, RStudio has fantastic cheat sheets about a lot of R data science libraries, which you can use as a reference about the kinds of functions these libraries provide.