What are functions in a data set

Introduction to R

4.1 Tidy data

When talking about data sets in statistics / data science, the terms wide and long used. These relate to the shape of a data set: a wide Record has more columns than a corresponding one long Record and vice versa has a long Record more rows than a corresponding one wide Record.

More precisely it has in one long Record each observation in its own row and each variable is a column. In one wide Dataset, variables (especially repeated measurements) can be distributed over several columns, and thus each row has a specific one person represent.

Consider the following example: We have two observations (of some psychological variable) for two people, at three different points in time. Let's call the times, and. The following representation of the data set applies as wide format:

Here each line stands for one person and the observations are distributed over the three columns, and. However, these columns actually represent the three levels of a (repeated measurement) factor. We can have exactly the same data in one longer Represent format in which each line has a observation corresponds and each column has a variable represents. There are then three variables in total:, and, with the placeholder for the measured (psychological) variable, the characteristics of which are listed in this column.

We no longer have a line for each person. The values ​​of variables that do not change for each observation (= each line) are now repeated across the lines (applies to and).

A record organized this way is often called a tidy designated.

As we shall see later, many types of statistical analysis, and especially graphing functions, require one long Data set, and therefore a surprising amount of time is often spent organizing data for further analysis (this type of work is often referred to as “data wrangling”).

In this course we will work with the packages for manipulating and transforming data. Of course, there are other ways to manipulate data, but we believe the method is the most consistent way to work with data, while also reducing the cognitive burden on users.

The packages that are used for data manipulation are for the transformation / reshaping of data sets and for the manipulation / processing of data sets and the variables contained therein. We also use a function from the package for factors. The features we are going to look at are:

Packagefunctionuse
increases the number of rows, decreases the number of columns
decreases the number of rows, increases the number of columns
deletes all lines of a data record that contain missing values ​​(NA)
for renaming variables
selects variables (columns)
selects observations (rows)
sorts a data set according to a certain variable
creates new variables and changes existing variables
recodes numeric variables
for recoding / renaming factor levels
enables operations on subsets of the data
/ summarizes data

The functions,, and have so-called scoped Versions with which the same transformation can be applied to several variables at the same time. There are three variants of each of these functions:

  • applies the function to all variables in the data set (e.g.)

  • applies the function to variables that start with a character vector or have been selected with the help function (e.g.)

  • applies the function to variables that have been selected with a condition (function) (e.g.)

With these functions, very complex data manipulations can be carried out and the R code remains relatively clear. Besides the above it contains many more functions. This includes, for example, functions that make it possible to combine different data records with one another (“merge”).

4.2 The pipe operator

We have already noticed that code can quickly become cluttered when we perform a sequence of operations. This leads to nested function calls.

Example: We have a numerical vector of \ (n = 10 \) measured values ​​(generated here for exercise purposes with random numbers with normal distribution) and want to center them first, then calculate the standard deviation, and then round them to two decimal places.

We can perform the required calculation of the rounded standard deviation of the centered values ​​as nested function calls:

The functions, and are now executed in sequence (from inside to outside) in such a way that the output of one function is passed as input to the next function.

The functions and have additional arguments:, or. While this is efficient, it results in code that is difficult to read.

An alternative to this would be to save the intermediate steps as variables:

Each of the sub-steps is on its own line and we understand the code without any problems. However, this method requires that we define two variables that we don't really need.

But there is a very elegant method to call functions one after the other without having to write these functions nested in one another: we use the operator for this. This is provided by the package and looks like this:

and is as infixOperator defined. That means he between two objects, similar to a mathematical operator. The name is to be understood in such a way that we “forward” or “hand over” an object to a function.

Actually, the operator is in the package and is imported by. If you want to know more about it, you can have a look here.

\(~\)

This operator is used so often that it already has its own key combination: ++ (MacOS) or ++ (Windows and Linux).

Our example from above:

becomes with the operator:

This code reads like this:

  1. We start with the object and pass it to the function as an argument
  2. We apply to it, with the additional arguments, and pass the output as an argument to the function
  3. We apply (without further arguments) and pass on the output as an argument
  4. , with the further argument, is executed. Since there is no more, the output is written to the console.

So it is clear: if we want to use the result further, we have to assign it to a variable:

So we pass an object to a function with. Unless we specify anything further, this object is the first argument of the function. The big advantages are:

  • Our code is more readable.
  • We didn't have to define any unnecessary variables.

Pipe operator syntax

The operator is generally used as follows. Let us assume, and be functions, then:

If another argument is from, then:

If we apply, and in sequence, then the following applies:

We don't have to pass the object as the first argument; we can use it anywhere. For this we need the argument placeholder:

In most cases, however, the object that is passed is also the first argument of the next function (especially for the functions), so we rarely need this placeholder.

When we work with the functions of and packages, we will use this operator very often. Another reason to get used to it is that it is used more and more and a lot of examples on the Internet (e.g. on Stackoverflow) use the operator.

4.3 Reshaping:

To transform a data set (reshaping) we need two functions: and, both are in the package.

4.3.1

We use when we have one wide Data record for a long Want to convert the record. is therefore used to combine several columns that represent the levels of a factor into one column that represents the factor itself. The values ​​in the original variables are summarized in a value variable.

\(~\)

The syntax of looks like this:

or with%>%

The arguments have the following meanings:

Let's look again at the example used above. We have a wide Data set with two people who were measured at three points in time.

We assume that, and are not really separate variables, but that they can be viewed as levels of a time factor (e.g. as three points in time in a repeated measurement design). The values ​​in columns, and relate to the same type of measurement (e.g. score) and should therefore actually go through a Variable in one long Dataset are represented. We want our new factor to be named and called the measured variable. The variable should be ignored, i.e. it should not be used during the restructuring (i.e. it should not be viewed by the function as a further measurement point).

should be defined as a factor in the next step:

The call to the function could also have been written more explicitly:

In the argument we used: so all columns are used except for the wide-to-long transformation. Instead, we could have selected the three time columns directly:

We'll see more examples of column selection as we look at the package.

example

Let's look at another example, this time using the record:

The structure of this data set is similar to that of the example above. We want to combine the columns and because they are levels of a common factor. So the data in the cells represent repeated measurements. and should be ignored for pivoting, but remain in the record.

Let's do the wide-to-long transformation:

4.3.2

is the opposite of. This function takes a factor and a measurement variable and “distributes” the values ​​of the measurement variable over new columns, which represent the levels of the factor. This means that we use when we come out of one long Record a wide Want to make record.

The syntax looks like this:

This means, illustrated by our simple example, that we want to create new variables for each level of the factor, using the values ​​of the variable.

looks exactly like the original data set. We can still check whether both are really equivalent:

example

Now let's convert the “Therapy” record from long back to wide:

4.3.3 Exclude missing values:

A very important function in the package is. With this we can delete all lines that have missing values.

This example serves as an illustration:

This is very useful, but of course this function should be used carefully, otherwise you may delete lines that have NAs in variables that may not be relevant for the question at hand (\ (\ rightarrow \) first create sub-data sets).

4.4 Manipulate data:

Now we are able to get records from wide to long and vice versa, but that is not the end of our work. Most data sets have to be edited before they can be analyzed: we have to select cases and / or variables, recode values, rename variables, sort according to certain variables, create new variables, divide data sets according to grouping variables or combine variables.

The package provides functions for all of these tasks (and many more, we are only looking at a small selection here). consists, so to speak, of verbs (functions) for all these operations, and these functions can be put together in a very elegant way as required.

In this lecture we will only get to know a small part of the functionality of. If you want to know more, you can look it up in the Help Pages:.

\(~\)

We'll work with you for the rest of this chapter long Records. If a record wide is, it should be converted.

We load the package first:

We will now take a look at the various functions and their uses in turn. We always use the operator. The input data frame is always to be understood as the first argument of the function. For the examples we use the data sets, and. You can find the latter in the data folder on ILIAS.

The following applies to all of the examples below: if we want to continue using the result, we have to assign the output to a new variable.

4.4.1 Renaming variables with

Variables can be renamed with. The variables that have not been renamed remain in the data record.

Syntax:

example

4.4.2 Select variables with

With the function we select variables from a data set.

Syntax:

selects the variables from the data frame and is omitted.

Examples

Order of variables

We can also change the order of the variables with:

“Help” functions for