Convenience function to paste together multiple columns into one. Unite(data, col., sep = ', remove = TRUE, na.rm = FALSE). Dataset to combine with x. By: Variables used to combine `x` and `y` add: Variables to add from `y` type: The main bind and join types from the dplyr package are provided. Innerjoin returns all rows from x with matching values in y, and all columns from x and y. If there are multiple matches between x and y, all match combinations are returned.
This is a second post in a series of dplyr functions. It covers tools to manipulate your columns to get them theway you want them: this can be the calculation of a new column, changing a column into discrete values or splitting/merging columns.
Content
- Mutating several columns atonce
- Working with discrete columns
- Turning data into NA
The data
As per previous blog posts, many of these functions truly shine when youhave a lot of columns, but to make it easy on people to copy paste codeand experiment, I'm using a ggplot2 built-in dataset:
Mutating columns: the basics
You can make new columns with the mutate()
function. The optionsinside mutate are almost endless: pretty much anything that you can doto normal vectors, can be done inside a mutate()
function.
Anything inside mutate
can either be a new column (by giving mutate anew column name), or can replace the current column (by keeping the samecolumn name).
One of the simplest options is a calculation based on values in othercolumns. In the sample code, we're changing the sleep data from datameasured in hours to minutes.
New columns can be made with aggregate functions such as average,median, max, min, sd, …
The sample code makes two new columns: one showing the difference ofeach row versus the average sleep time, and one showing the differenceversus the animal with the least sleep.
In the below comments, Steve asked about aggregate functions acrosscolumns. These functions by nature will want to summarise a column (likeshown above), if however you want to sum()
or mean()
across columns,you might run into errors or absurd answers. In these cases you can either revert to actually spelling out the arithmetics:mutate(average = (sleep_rem + sleep_cycle) / 2)
or you have to add aspecial instruction to the pipe that it should perform these aggregatefunctions not on the entire column, but by row:
The ifelse()
function deserves a special mention because it isparticularly useful if you don't want to mutate the whole column in thesame way. With ifelse()
, you first specify a logical statement,afterwards what needs to happen if the statement returns TRUE
, andlastly what needs to happen if it's FALSE
.
Imagine that we have a database with two large values which we assumeare typos or measurement errors, and we want to exclude them. The belowcode will take any brainwt
value above 4 and return NA. In this case,the code won't change for anything below 4.
You can also mutate string columns with stringr's str_extract()
function in combination with any character or regex patterns.
The sample code will return the last word of the animal name and makesit lower case.
Mutating several columns at once
This is where the magic really happens. Just like with the select()
functions in part 1, there are variants to mutate()
:
mutate_all()
will mutate all columns based on your furtherinstructionsmutate_if()
first requires a function that returns a boolean toselect columns. If that is true, the mutate instructions will befollowed on those variables.mutate_at()
requires you to specify columns inside avars()
argument for which the mutation will be done.
Mutate all
The mutate_all()
version is the easiest to understand, and prettynifty when cleaning your data. You just pass an action (in the form of afunction) that you want to apply across all columns.
Something easy to start with: turning all the data to lower case:
The mutating action needs to be a function: in many cases you can passthe function name without the brackets, but in some cases you needarguments or you want to combine elements. In this case you have someoptions: either you make a function up front (useful if it's longer), oryou make a function on the fly by wrapping it inside funs()
or via atilde.
For instance, after scraping the web, you often have tables with toomany spaces and extra n
signs, but you can clean it all in one go.
I'm first going to use mutate_all()
to screw things up:The below paste mutation requires a function on the fly. You can eitheruse ~paste(., ' /n ')
or funs(paste(., ' /n '))
. When making afunction on the fly, you usually need a way to refer to the value youare replacing: which is what the .
symbolizes.
Let's clean it up again:
In this code I am assume that not all values show the same amount ofextra white spaces as is often the case with parsed data: it firstremoves any /n
, and then trims any additional white spaces:
Mutate if
Not all cleaning functions can be done with mutate_all()
. Trying toround your data will lead to an error if you have both numerical andcharacter columns.
Error in mutate_impl(.data, dots) : Evaluation error: non-numeric argument to mathematical function.
In these cases we have to add the condition that columns need to benumeric before giving round()
instructions, which can be done with mutate_if.
By using mutate_if()
we need two arguments inside a pipe:
First it needs information about the columns you want it toconsider. This information needs to be a function that returns aboolean value. The easiest cases are functions like
is.numeric
,is.integer
,is.double
,is.logical
,is.factor
,lubridate::is.POSIXt
orlubridate::is.Date
.Secondly, it needs instructions about the mutation in the form of afunction. If needed, use a tilde or
funs()
before (see above).
Mutate at to change specific columns
By using mutate_at()
we need two arguments inside a pipe:
First it needs information about the columns you want it toconsider. In this case you can wrap any selection of columns (usingall the options possible inside a
select()
function) and wrap itinsidevars()
.Secondly, it needs instructions about the mutation in the form of afunction. If needed, use a tilde or
funs()
before (see above).
All sleep-measuring columns are in hours. If I want those in minutes, Ican use mutate_at()
and wrap all ‘sleep' containing columns insidevars()
. Secondly, I make a function in the fly to multiple every valueby 60.
The sample code shows that in this case all sleep
columns have beenchanged into minutes, but awake
did not.
- Mutating several columns atonce
- Working with discrete columns
- Turning data into NA
The data
As per previous blog posts, many of these functions truly shine when youhave a lot of columns, but to make it easy on people to copy paste codeand experiment, I'm using a ggplot2 built-in dataset:
Mutating columns: the basics
You can make new columns with the mutate()
function. The optionsinside mutate are almost endless: pretty much anything that you can doto normal vectors, can be done inside a mutate()
function.
Anything inside mutate
can either be a new column (by giving mutate anew column name), or can replace the current column (by keeping the samecolumn name).
One of the simplest options is a calculation based on values in othercolumns. In the sample code, we're changing the sleep data from datameasured in hours to minutes.
New columns can be made with aggregate functions such as average,median, max, min, sd, …
The sample code makes two new columns: one showing the difference ofeach row versus the average sleep time, and one showing the differenceversus the animal with the least sleep.
In the below comments, Steve asked about aggregate functions acrosscolumns. These functions by nature will want to summarise a column (likeshown above), if however you want to sum()
or mean()
across columns,you might run into errors or absurd answers. In these cases you can either revert to actually spelling out the arithmetics:mutate(average = (sleep_rem + sleep_cycle) / 2)
or you have to add aspecial instruction to the pipe that it should perform these aggregatefunctions not on the entire column, but by row:
The ifelse()
function deserves a special mention because it isparticularly useful if you don't want to mutate the whole column in thesame way. With ifelse()
, you first specify a logical statement,afterwards what needs to happen if the statement returns TRUE
, andlastly what needs to happen if it's FALSE
.
Imagine that we have a database with two large values which we assumeare typos or measurement errors, and we want to exclude them. The belowcode will take any brainwt
value above 4 and return NA. In this case,the code won't change for anything below 4.
You can also mutate string columns with stringr's str_extract()
function in combination with any character or regex patterns.
The sample code will return the last word of the animal name and makesit lower case.
Mutating several columns at once
This is where the magic really happens. Just like with the select()
functions in part 1, there are variants to mutate()
:
mutate_all()
will mutate all columns based on your furtherinstructionsmutate_if()
first requires a function that returns a boolean toselect columns. If that is true, the mutate instructions will befollowed on those variables.mutate_at()
requires you to specify columns inside avars()
argument for which the mutation will be done.
Mutate all
The mutate_all()
version is the easiest to understand, and prettynifty when cleaning your data. You just pass an action (in the form of afunction) that you want to apply across all columns.
Something easy to start with: turning all the data to lower case:
The mutating action needs to be a function: in many cases you can passthe function name without the brackets, but in some cases you needarguments or you want to combine elements. In this case you have someoptions: either you make a function up front (useful if it's longer), oryou make a function on the fly by wrapping it inside funs()
or via atilde.
For instance, after scraping the web, you often have tables with toomany spaces and extra n
signs, but you can clean it all in one go.
I'm first going to use mutate_all()
to screw things up:The below paste mutation requires a function on the fly. You can eitheruse ~paste(., ' /n ')
or funs(paste(., ' /n '))
. When making afunction on the fly, you usually need a way to refer to the value youare replacing: which is what the .
symbolizes.
Let's clean it up again:
In this code I am assume that not all values show the same amount ofextra white spaces as is often the case with parsed data: it firstremoves any /n
, and then trims any additional white spaces:
Mutate if
Not all cleaning functions can be done with mutate_all()
. Trying toround your data will lead to an error if you have both numerical andcharacter columns.
Error in mutate_impl(.data, dots) : Evaluation error: non-numeric argument to mathematical function.
In these cases we have to add the condition that columns need to benumeric before giving round()
instructions, which can be done with mutate_if.
By using mutate_if()
we need two arguments inside a pipe:
First it needs information about the columns you want it toconsider. This information needs to be a function that returns aboolean value. The easiest cases are functions like
is.numeric
,is.integer
,is.double
,is.logical
,is.factor
,lubridate::is.POSIXt
orlubridate::is.Date
.Secondly, it needs instructions about the mutation in the form of afunction. If needed, use a tilde or
funs()
before (see above).
Mutate at to change specific columns
By using mutate_at()
we need two arguments inside a pipe:
First it needs information about the columns you want it toconsider. In this case you can wrap any selection of columns (usingall the options possible inside a
select()
function) and wrap itinsidevars()
.Secondly, it needs instructions about the mutation in the form of afunction. If needed, use a tilde or
funs()
before (see above).
All sleep-measuring columns are in hours. If I want those in minutes, Ican use mutate_at()
and wrap all ‘sleep' containing columns insidevars()
. Secondly, I make a function in the fly to multiple every valueby 60.
The sample code shows that in this case all sleep
columns have beenchanged into minutes, but awake
did not.
Changing column names after mutation
With a singular mutate()
statement, you immediately have the option tochange the columns name. In the above example for instance it isconfusing that the sleep columns are in a different unit, you can changethat by calling a rename function:
Or as TomasMcManuspointed out: you can assign a 'tag' inside funs()
which will be appended to the current name. Themain difference between both options: the funs()
version is one lineof code less, but columns will be added rather than replaced. Dependingon your scenario, either could be useful.
Working with discrete columns
Recoding discrete columns
To rename or reorganize current discrete columns, you can use recode()
inside a mutate()
statement: this enables you to change the currentnaming, or to group current levels into less levels. The .default
refers to anything that isn't covered by the before groups with theexception of NA. You can change NA into something other than NA byadding a .missing
argument if you want (see next sample code).
A special version exists to return a factor: recode_factor()
. Bydefault the .ordered
argument is FALSE
. To return an ordered factorset the argument to TRUE
:
Creating new discrete column (two levels)
The ifelse()
statement can be used to turn a numeric column into adiscrete one. As mentioned above, ifelse()
takes a logical expression,then what to do if the expression returns TRUE
and lastly what to dowhen it returns FALSE
.
The sample code will divide the current measure sleep_total
into adiscrete 'long' or 'short' sleeper.
R Concatenate Two Columns
Creating new discrete column (multiple levels)
The ifelse()
can be nested but if you want more than two levels, butit might be even easier to use case_when()
which allows as manystatements as you like and is easier to read than many nested ifelse
statements.
The arguments are evaluated in order, so only the rows where the firststatement is not true will continue to be evaluated for the nextstatement. For everything that is left at the end just use theTRUE ~ 'newname'
.
Unfortunately there seems to be no easy way to get case_when()
toreturn an ordered factor, so you will need to to do that yourselfafterwards, either by using forcats::fct_relevel()
, or just with afactor()
function. If you have a lot of levels I would advice to makea levels vector upfront to avoid cluttering the piple too much.
The case_when()
function does not only work inside a column, but canbe used for grouping across columns:
Splitting and merging columns
Take for example this dataset
You can unmerge any columns by using tidyr's separate()
function. Todo this, you have to specify the column to be splitted, followed by thenew column names, and which seperator it has to look for.
The sample code shows seperating into two columns based on ‘=' as aseparator.
The opposite is tidyr's unite()
function. You specify the new columnname, and then the columns to be united, and lastly what seperator youwant to use.
Bringing in columns from other data tables
If you want to add information from another table, you can use thejoining functions from dplyr
. The msleep data contains abbreviationsfor conservation but if you are not familiar with the topic you mightneed the description we used in the section above inside the msleepdata.
Joins would be a chapter in itself, but in this particular case youwould do a left_join()
, i.e. keeping my main table (on the left), andadding columns from another one to the right. In the by =
statementyou specify which colums are the same, so the join knows what to addwhere.
The sample code will add the description of the different conservationstates into our main msleep
table. The main data contained an extradomisticated
label which i wanted to keep. This is done in the lastline of the table with an ifelse()
.
Spreading and gathering data
The gather()
function will gather up many columns into one. In thiscase, we have 3 columns that describe a time measure. For some analysisand graphs, it might be necessary to get them all into one.
The gather
function needs you to give a name ('key') for the newdescriptive column, and a another name ('value') for the value column.The columns that you don't want to gather need to be deselected at theend. In the sample code I'm deselecting the column name
.
A useful attribute in gathering is the factor_key
argument which isFALSE
by default. In the previous example the new columnsleep_measure
is a character vector. If you are going to summarise orplot afterwards, that column will be ordered alphabetically.
If you want to preserve the original order, add factor_key = TRUE
which will make the new column an ordered factor.
The opposite of gathering is spreading. Spread will take one column andmake multiple columns out of it. If you would have started with theprevious column, you could get the differrent sleep measures indifferent columns:
Turning data into NA
The function na_if()
turns particular values into NA
. In most casesthe command probably be na_if(')
(i.e turn an empty string into NA),but in principle you can do anything.
The same code will turn any value that reads 'omni' into NA
Want to learn more?
How to combine select() and arrange() dplyr functions together to select a column first, then arrange that selected column?
I was used the 'flights' data set through 'nycflights13' package. And I tried to select the 'dep_delay' column and same time to arrange it. And at last I tried to extract first 10 rows. I done that task in three steps which mention below,
1). sel <- select(flights, dep_delay)
2). arr <- arrange (sel, dep_delay)
3). view (filter(arr, between(row_number(), 1, 10)))
Dplyr Combine Columns Pdf
But I want to combine these three code lines into a single code line.