- R vs. Python – Basic differences
- R vs. Python – Pipes
- R vs. Python – Classes and objects
R and python are two of the main programming languages used in data science. Sometimes it may be necessary to move from one programming language to another. For example because python is the more generalized programming language while R is more specialized on data analysis.
At the pyconDE 2022 I gave a talk with the title Rewriting your R analysis code in Python.
This is part 2 of a series of blog posts about the different topics I talked about in the talk.
Part 2 is all about pipes.
Summary
- Part 1: Basic differences
- Part 2: Pipes
- Part 3: Classes
- Part 4: Library Loading
- Part 5: … vs. args,kwargs
- Part 6: Non-standard evaluation
- Part 7: Run R code from python and vice versa
Pipes
A pipe is something, where you can fill something in on one end, and it comes
out on the other end. But you can do some processing on the way.
For example: take a water pipe. Cold water gets filled in. It goes to a
heater, then the water is warm. Afterwards it goes to the dishwasher,
and you get clean dishes, but the water gets dirty.
The same happens when you fill in data into a data processing pipe.
Especially the pipe %>%
symbol from the magrittr-package and the
helper function from the dplyr-package are a well known tool for
data processing pipes in R. Since R version 4.1, there is also the native pipe|>
which uses the |
-symbol that was already used for pipes on unix shells for a long time.
The example uses the palmer penguins example data and groups the data by the species and does a simple summary per species:
R: magrittr-Pipe
library(dplyr) library(palmerpenguins) penguins %>% group_by(species) %>% summarise( n = n(), bill_depth = mean(bill_depth_mm, na.rm=TRUE), body_mass = mean(body_mass_g, na.rm=TRUE) )
## # A tibble: 3 × 4
## species n bill_depth body_mass
## <fct> <int> <dbl> <dbl>
## 1 Adelie 152 18.3 3701.
## 2 Chinstrap 68 18.4 3733.
## 3 Gentoo 124 15.0 5076.
R: Native Pipe
Same example using the native pipe:
penguins |> group_by(species) |> summarise( n = n(), bill_depth = mean(bill_depth_mm, na.rm=TRUE), body_mass = mean(body_mass_g, na.rm=TRUE) )
## # A tibble: 3 × 4
## species n bill_depth body_mass
## <fct> <int> <dbl> <dbl>
## 1 Adelie 152 18.3 3701.
## 2 Chinstrap 68 18.4 3733.
## 3 Gentoo 124 15.0 5076.
Python
Python does not have the same piping mechanism. But it is object oriented,
and you can execute functions that belong to the object by accessing them using the
-operator. And with \ you can tell python that you are writing your code in the next line. So you can do something that looks similar to R pipes in python using a pandas data frame and .\ as pipe-like operator:.
!pip3 install palmerpenguins from palmerpenguins import load_penguins penguins = load_penguins() penguins.\ groupby('species').\ aggregate({ 'bill_length_mm': ['mean', 'count'], 'body_mass_g': ['mean'] })
## bill_depth_mm body_mass_g
## count mean mean
## species
## Adelie 151 18.346358 -1.412451e+07
## Chinstrap 68 18.420588 3.733088e+03
## Gentoo 123 14.982114 -1.731338e+07
The main advantage of the R pipe is, that you can simply combine functions from different packages as well as your own to create a data processing pipe.
In python, you can only use functions that already exist in the object. And adding more functions to those result objects requires much more work with classes and inheritance.
Python pipe Package
There is a package for python called pipe that allows piping of data using the |-sign,
but it does not work on pandas
data frames.
But it can be used to define functions that to similar things that are known from unix command lines, like grep
:
from pipe import Pipe import re @Pipe def grep(iterable, pattern, flags=0): for line in iterable: if re.match(pattern, line, flags=flags): yield line lines = ["Hello", "hello", "World", "world"] for line in lines | grep("H"): print(line)
## Hello
I mention the pipe
package as honorable mention, but since it works different from
the pipes in R, I do not recommend it.