Explore R libraries for end to end data science projects

duy ngọc
12 min readApr 20, 2021

[ You can see my another post about Explore Python libraries for end to end data science projects in this link]

Important note: this post is just a collection of most popular R libraries for data science I’ve known before, some other important libraries are not in this list therefore please leave me your feedback for improvement. Almost information related to these libraries I quote directly from the their website (with link/source) to keep objectivity. For my personal opinions I also have a specific note. Happy reading !

R is a powerful language for statistic analysis, visualization, time series analysis, pre-processing data and analytic web app development . It means R can cover full stack (end to end) data science project with some important steps below:

Data preparation→ Data visualization → Feature engineering→Build & validate ML model → Explain model → Communicate results → Deployment (web app)

Today we will explore some useful R libraries that can be used for full stack data science purpose :

  1. Data preparation:

Tidiverse home page: https://www.tidyverse.org/
Data.table home page: https://rdatatable.gitlab.io/data.table/
Sparklyr home page: https://spark.rstudio.com/

Tidyverse and data.table are the best tools for acquiring & preparing data in R. Tidyverse has user-friendly syntax (select, mutate, filter, summarise, group_by…) compared to not friendly syntax like data.table (R) or pandas (python). Howerver, data.table is very fast compared to tidyverse or pandas (python). I often choose data.table for large csv file (best memory management and speed). For very large data set need distributed data processing across multiple computers, spark is the best choice

Update from https://dtplyr.tidyverse.org/: tidyverse also have package dtplyr (it provides a data.table backend for dplyr) . The goal of dtplyr is to allow you to write dplyr code (select, filter….) that is automatically translated to the equivalent, but usually much faster, data.table code.

Tidyverse ecosystem picture (author: Silvia Canelón, PhD. Original link)

  • Csv, txt, tsv … file: readr (tidyverse), fread (data.table)
Example from https://readr.tidyverse.org/reference/read_delim.html:
library(tidyverse)
read_csv("mtcars.csv")
read_tsv("a\tb\n1.0\t2.0")
read_delim("a|b\n1.0|2.0", delim = "|")
starwars %>% filter(species == "Droid")
starwars %>%
mutate(name, bmi = mass / ((height / 100) ^ 2)) %>%
select(name:mass, bmi)
Example from this link:library(data.table)
flights <- fread ("flights14.csv")
flights[origin == "JFK" & month == 6L] # Get all the flights with “JFK” as the origin airport in the month of June.
flights[, c("arr_delay", "dep_delay")] # select column
  • Excel file: readxl (tidyverse)
library(readxl)
#import chickwts sheet
read_excel("madrid_temp.xlsx", sheet = "chickwts")
  • Database (MySQL…) : dplyr (backend is dbplyr ), DBI . You can query data directly from database with dplyr (no SQL require), the dbplyr backend will automate convert dplyr syntax to sql code as example below:
Example from this link :library(dplyr) 
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
flights <- copy_to(con, nycflights13::flights)
flights %>%
select(contains("delay")) %>%
show_query()
#> <SQL>
#> SELECT `dep_delay`, `arr_delay`
#> FROM `nycflights13::flights`
# if databases don't live in a file, but instead live on another server:con <- DBI::dbConnect(RMySQL::MySQL(), host = "database.rstudio.com", user = "hadley", password = rstudioapi::askForPassword("Database password") )
  • Web: rvest, RCurl

Rvest is web scrapping tool in R that is nearly the same as beautifulshop in python. The advantage of this tool is that you can apply “tidy way" like %>% when coding (better understanding code than python)

Example from https://rvest.tidyverse.org/ :library(rvest)

# Start by reading a HTML page with read_html():
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")
films <- starwars %>% html_elements("section")
title <- films %>%
html_element("h2") %>%
html_text2()
library(RCurl)
web <- "https://raw.githubusercontent.com/alstat/Analysis-with-Programming/master/2013/R/How%20to%20Enter%20Your%20Data/Data.dat"
x <- getURL(web)
y <- read.table(text = x, header = TRUE)

Sparklyr: R interface for Apache Spark (https://spark.rstudio.com/)

[Definition link]Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Sparklyr [Definition link]: R interface for spark in ‘tidyverse’ way. With sparklyr you can connect to spark from R (with dplyr backend), Filter and aggregate Spark datasets then bring them into R for analysis and visualization, Use Spark’s distributed machine learning library from R,Create extensions that call the full Spark API and provide interfaces to Spark packages

Picture from https://spark.rstudio.com/guides/data-lakes/:

Resource to learn sparklyr and big data in R:
- Web page: https://spark.rstudio.com/
- Mastering Spark with R book: https://therinspark.com/
- Exploring, Visualizing, and Modeling Big Data with R: https://okanbulut.github.io/bigdata/

Example from https://spark.rstudio.com/:
# Connect to spark:
library(sparklyr)
sc <- spark_connect(master = "local")
# We can now use all of the available dplyr verbs against the tables within the cluster.

2. Data Visualization:

You can visit this excellent website (https://www.r-graph-gallery.com/index.html) to discovery hundreds charts made with R programming language. These libraries below is just for reference

[Link] ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

[My point of view] : This is the best visualization library for chart, the only drawback is that it is not interactive chart

Picture from https://www.r-graph-gallery.com/ggplot2-package.html

Example from https://ggplot2.tidyverse.org/:library(ggplot2)

ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()

[Definition link] Highcharts is a modern SVG-based, multi-platform charting library. It makes it easy to add interactive charts to web and mobile projects.

[My point of view] It can be used with mediumlarge data set (very effective compared to plotly or dygraph)

Picture from https://www.highcharts.com/blog/products/highcharts/

Example from https://www.highcharts.com/blog/tutorials/highcharts-for-r-users/:
library(highcharter)
highchart(type = "stock") %>%
hc_add_series(asset_returns_xts$SPY, type = "line")

Plotly’s R graphing library makes interactive, publication-quality graphs. [My point of view]: It can be used with small — medium data set

Picture from https://plotly.com/r/

Example from https://plotly.com/r/candlestick-charts/ :
library(plotly)
library(quantmod)
getSymbols("AAPL",src='yahoo')
# basic example of ohlc charts df <- data.frame(Date=index(AAPL),coredata(AAPL)) df <- tail(df, 30) fig <- df %>% plot_ly(x = ~Date, type="candlestick", open = ~AAPL.Open, close = ~AAPL.Close, high = ~AAPL.High, low = ~AAPL.Low) fig <- fig %>% layout(title = "Basic Candlestick Chart") fig

3. Feature engineering

The idea of the recipes package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. “feature engineering”).

This package is extremely useful because it reduces ton of time for clean, transform & add new feature to raw dataset ( compared to sklearn.preprocessing & sklearn.pipeline in python). I hope scikit-learn team will develop something like recipes in near future to help speed up feature engineer tasks in python environment

Some examples from https://recipes.tidymodels.org/reference/index.html :

… (see more in the link above)

  • Timetk (for time series feature engineering)

https://business-science.github.io/timetk/

Timetk is the very best tool from business science for time series feature engineering. It is atidyverse toolkit to visualize, wrangle, and transform time series data (from business-science.github.io)

The timetk has step_timeseries_signature(), which is used to add a number of features that can help machine learning models (new features such as day, week, month, quarter, interaction between day, week, month…). You can see examples in their excellent post in this link

Other wonderful open source packages from business-science you can see in this link (regarding to forecasting, timeseries, feature engineer, EDA… ). Check more information from www.business-science.io ( I learn a lot from this website and it helps me save time to master R skill applied to business)

4. Machine learning model building & validation

H2O is the machine learning package that I used the most in R. The speed of H2O ML is very fast compared to traditional R machine learning packages (caret). It has multi languages interface ( R/Python), handle big data well with distributed processing, reduce time for selecting and tuning model with autoML, ready for production deployment… Moreover H2O also has full eco-system for almost data science tasks as description below:

Example from this link and link:
# 1. Getting start H2O:
library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
# 2. Importing Data:
h2o.importFile(), h2o.importFolder(),h2o.importURL()
# 3. Data Manipulation and Description:
iris.h2o <- as.h2o(localH2O, iris.r, key="iris.h2o") # convert dataframe to h2o frame
summary(iris.h2o)
colnames(iris.h2o)
....
# 4. Running Machine learning models:
H2O offers some famous ML models: GBM, GLM, Xgboost, Lightgbm, Randomforest, K-Means, PCA ...
h2o.gbm(y = dependent, x = independent, data = australia.hex,
n.trees = 10, interaction.depth = 3,
n.minobsinnode = 2, shrinkage = 0.2, distribution= "gaussian")
# 5. Model Explainability (picture below):
# Run AutoML for 1 minute
aml <- h2o.automl(y = y, training_frame = train, max_runtime_secs = 60, seed = 1)
# Explain a single H2O model (e.g. leader model from AutoML)
exm <- h2o.explain(aml@leader, test)
# Explain first row with a single H2O model (e.g. leader model from AutoML)
h2o.explain_row(aml@leader, test, row_index = 1)
# Methods for an AutoML object
h2o.varimp_heatmap()
h2o.model_correlation_heatmap()
h2o.pd_multi_plot()
# Methods for an H2O model
h2o.residual_analysis_plot()
h2o.varimp_plot()
h2o.shap_explain_row_plot()
h2o.shap_summary_plot()
h2o.pd_plot()
h2o.ice_plot()

The main drawback of H2O ML is that the saved model cannot be run in different version H2O package. It means that if I updates H2O package, I must train my model again to make it compatible with the new environmental
All H2O ML method (link) :

Definition from https://topepo.github.io/caret/:

The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: data splitting, pre-processing, feature selection, model tuning using resampling, variable importance estimation… The main drawback of caret it that its speed is low compared to H2O and scikit-learn (some caret model were written in C language to improve speed). R is not the best language for speed optimization.

Example from this link:library(caret)
library(mlbench)
data(Sonar)
# Train / test split:
set.seed(107)
inTrain <- createDataPartition(
y = Sonar$Class,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE
)
training <- Sonar[ inTrain,]
testing <- Sonar[-inTrain,]
# Apply caret model with hyperparameter tunning and repeated cv ctrl <- trainControl(
method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary
)

set.seed(123)
plsFit <- train(
Class ~ .,
data = training,
method = "pls",
preProc = c("center", "scale"),
tuneLength = 15,
trControl = ctrl,
metric = "ROC"
)
# Visual hyperparameter tuning performance with ggplot
ggplot(plsFit)
# Predict and calculate model performance:
plsClasses <- predict(plsFit, newdata = testing)
str(plsClasses)
#> Factor w/ 2 levels "M","R": 2 1 1 1 2 2 1 2 2 2 ...
plsProbs <- predict(plsFit, newdata = testing, type = "prob")
confusionMatrix(data = plsClasses, testing$Class)

Definition from https://www.tidymodels.org/: The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. Tidymodels help organize all machine learning process (split data, feature engineering, create model, create workflow, tuning….) in the tidy ways. I think it is easy to use than scikit-learn pipeline (python) and I prefer tidymodels than caret for machine learning model set up (personal idea)

You can see some important packages in tidymodels in the picture below:
(Picture from https://jhudatascience.org/tidyversecourse/model.html)

Example from https://www.tidymodels.org/start/case-study/ :
library
(tidymodels)
hotels <-
read_csv('https://tidymodels.org/start/case-study/hotels.csv') %>%
mutate_if(is.character, as.factor)

# DATA SPLITTING & RESAMPLING (package rsample)
set.seed(123)
splits <- initial_split(hotels, strata = children)

hotel_other <- training(splits)
hotel_test <- testing(splits)
# BUILD THE MODEL (package parsnip)
lr_mod <-
logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
# CREATE THE RECIPE (package recipes)
holidays <- c("AllSouls", "AshWednesday", "ChristmasEve", "Easter",
"ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday")

lr_recipe <-
recipe(children ~ ., data = hotel_other) %>%
step_date(arrival_date) %>%
step_holiday(arrival_date, holidays = holidays) %>%
step_rm(arrival_date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
# CREATE THE WORKFLOW (package workflows)
lr_workflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe)
# CREATE THE GRID FOR TUNING
lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
# TRAIN AND TUNE THE MODEL (package tune)
lr_res <-
lr_workflow %>%
tune_grid(val_set,
grid = lr_reg_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc))
lr_best <-
lr_res %>%
collect_metrics() %>%
arrange(penalty) %>%
slice(12)

5. Explain model

Explaining machine learning model is very important step because it will help us understanding why our model predict the result. With Dalex package, the machine learning is not a black box anymore, we can find which important features affect our prediction and make recommendation or explanation to our stakeholder / customer.

According to Dalex package, we have two explain levels: instance level (one sample) and dataset level (whole samples). Instance level includes break down plots, shapley additive explanations, Lime, ceteris-paribus profiles, ceteris-paribus oscillations, local-diagnostics plots. Dataset level includes model performance, variable importance, partial dependence profiles, local dependence and accumulated local profiles, residual-diagnostic plots

Overview Dalex package (link from https://ema.drwhy.ai/introduction.html)

Instance level explanation (link: https://ema.drwhy.ai/summaryInstanceLevel.html)

Dataset level explanation (link: https://ema.drwhy.ai/summaryModelLevel.html)

6. Communicate Results

Overview (link) : R Markdown provides an authoring framework for data science. With R markdown you can save and execute code, generate high quality reports that can be shared with an audience

Support format (link): R Markdown supports dozens of static and dynamic output formats including HTML, PDF, MS Word, Beamer, HTML5 slides, Tufte-style handouts, books, dashboards, shiny applications, scientific articles, websites, and more.

Notebook output(link)

HTML output (link) :

Slide presentation (link) :

Dashboards (link) :

Websites (link) :

Interactive documents (link):

7. Deployment (web app)

Today business need application to make decision / report automatically , not just reporting in notebook or power point. Fortunately, R has some powerful packages for this purpose as discussion below:

Overview (link) : Shiny is an R package that makes it easy to build interactive web apps straight from R. You can host standalone apps on a webpage or embed them in R Markdown documents or build dashboards. You can also extend your Shiny apps with CSS themes, htmlwidgets, and JavaScript actions. Shiny combines the computational power of R with the interactivity of the modern web.

My opinion : with shiny, we can program front end & back end with only R language, no need html, css, javascript…. requirement. Shiny is very useful for data scientist to build stand alone data analytics web app for idea demo or production purpose (deployment with AWS, Shiny server, heroku….)

Very good book to learn shiny: Mastering Shiny by Hadley Wickham (link: https://mastering-shiny.org/)

Shiny for real estate investment web app (link)

Shiny for radiant web app (link)

Shiny for trading web app (link)

many more….

--

--

duy ngọc

I'm a data scientist looking forward to using algorithms to make the world a better place