How to handle dummy variables in R?

This recipe helps you handle dummy variables in R
Last Updated: 14 Jun 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

In Data Science, whenever we create machine learning models using different algorithms, we want all our variables to be numeric for the algorithm to process it. If the data we have is non-numeric then we need to process or handle the data before creating any model.

In this recipe, we will learn how to handle string categorical variable by converting them into a dummmy variable.

Categorical variable is a type of variable which has distinct string values or categories to which different observations are assigned to. They don't hold any mathematical significance in creation of a model. Hence, we need to convert them into dummy variable which is similar to OneHotEncoding technique in Python. It creates (n-1) columns for n-unique categories/values in a categorical variable and assigns 0 and 1 to it. "1" indicating that the category is being considered.

Recipe Objective
- Step 1: Loading the required library and dataset
- Step 2: Creating dummy variable

Step 1: Loading the required library and dataset

We require fastDummies and knitr package to do so

# installing required package install.packages(c("fastDummies","knitr")) library(fastDummies) library(knitr) # Data manipulation package library(tidyverse) # reading a dataset customer_seg = read.csv('R_223_Mall_Customers.csv') glimpse(customer_seg)

Observations: 200
Variables: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Gender                  Male, Male, Female, Female, Female, Female, ...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

Step 2: Creating dummy variable

We create dummy variables for "Gender" variable using dummy_cols() function of fastDummies package.

Syntax: fastDummies::dummy_cols(x, select_columns = )

where:

x = dataframe
select_columns = Column (Categorical variable) that you wanna create dummy variables of.

# creating dummy variables df_dummies = fastDummies::dummy_cols(customer_seg, select_columns = "Gender") # dropping the original column along with Gender_female column to get (n-1) coluns similar to OneHotEncoding. new_customer_seg = df_dummies[c(-2,-6)] glimpse(new_customer_seg)

Rows: 200
Columns: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...
$ Gender_Male             1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1,...

Note: In the dummy variable (Gender_male) created: 1 = Male and 0 = Female

query_1 = mutate(STUDENT, Total_marks = Science_Marks+Math_Marks) glimpse(query_1)

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More