Explore Python Libraries for End to End Data Science Project

duy ngọc
Geek Culture
Published in
15 min readMay 3, 2021

--

[ You can see my another post about Explore R libraries for an end to end data science projects in this link]

Important note: this post is just a collection of most popular python libraries for data science I’ve known before, some other important libraries are not in this list therefore please leave me your feedback for improvement. Almost all information related to these libraries I quote directly from their website (with link/source) to keep the objectivity. For my personal opinions I also have a specific note. Happy reading!

Python is a very powerful language for Data science, especially in Machine Learning / Deep Learning. Python can cover full stack (end to end) data science project with some important steps below:

Data preparation→ Data visualization → Feature Engineering→ Build & validate ML model → Explain model → Communicate results → Deployment (web app)

Today we will explore some useful Python libraries that can be used for full stack data science purpose :

  1. Data preparation (get and clean data)

For data fit in memory :

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool (link) . It can be used in different file formats: csv, text file, SQL databases, excel, HDF5…

Example from this link:import pandas as pd
# Object creation:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
dates = pd.date_range("20130101", periods=6)
# Viewing data:
df.head(5) ; df.tail(3); df.index; df.columns; df.describe()
# Selection:
df["A"], df[0:3] # select by column
df.loc["20130102", ["A", "B"]] # select by label
df.iloc[3], df.iloc[3:5,0:2] # select by position
df[df["A"] > 0], df2[df2["E"].isin(["two", "four"])] # Boolean indexing
# Missing data:
df1.dropna(how="any") # drop any rows that have missing data
df1.fillna(value=5) # filling missing data

My opinions :
Pros: less coding , fast learning, easy customization
Cons: syntax is not easy to understand compared to tidyverse (R), difficult to deal with big data

For data larger-than-memory or distributed environments (big data) :

Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data processing, and flexible API.

Example from this linkimport datatable as dt
# Loading data:
DT4 = dt.fread("~/Downloads/dataset_01.csv")
DT5 = dt.open("data.jay")
# Basic Frame Properties
print(DT.shape) # (nrows, ncols) print(DT.names) # column names print(DT.stypes) # column types
#
Select Subsets of Rows/Columns
DT[:, "A"] # select 1 column
DT[:10, :] # first 10 rows
DT[::-1, "A":"D"] # reverse rows order, columns from A to D
DT[27, 3] # single element in row 27, column 3 (0-based)

Definition [Link]: Dask is a flexible library for parallel computing in Python. Dask is composed of two parts:

  1. Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  2. “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

(Picture from https://docs.dask.org/en/latest/)

Design (https://docs.dask.org/en/latest/dataframe.html) :

Dask DataFrames coordinate many Pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These Pandas objects may live on disk or on other machines.

Dask DataFrame copies Pandas API because the dask.dataframe application programming interface (API) is a subset of the Pandas API, it should be familiar to Pandas users. There are some slight alterations due to the parallel nature of Dask:

from dask.distributed import Client
client = Client(n_workers=4)
import dask
import dask.dataframe as dd
df = dd.read_csv('2015-*-*.csv')
df2 = df[df.y == 'a'].x + 1
# As with all Dask collections, one triggers computation by calling the .compute() method:df.groupby(df.user_id).value.mean().compute()
df2.compute()
df_filter=df[df[2]=='SU3037_0112'].compute()
df_filter.describe()

Definition (link) : Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

(Picture from databricks.com/spark/about)

Run PySpark in Google Colab:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"
import findspark
findspark.init()
import spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
path='Link to your data file'
df = spark.read.option("delimiter", "\t").csv(path)
df.show(5)
print((df.count(), len(df.columns)))
product_type='_c2'
df.groupBy(product_type).count().show()
product='SU3037_0112'
df_product=df.filter(df._c2 == product)
df_product.describe().show()

2. Data Visualization

Not Interactive chart:

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. From my point of view: this is very complex library (lot of customization) for data visualization compared to ggplot (R) or Seaborn.

(Picture from https://matplotlib.org/stable/tutorials/index.html)

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. [My opinion] : with seaborn you can create beautiful plot with less code than matplotlib

(Picture from https://seaborn.pydata.org/index.html)

plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2 . The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot.Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots easy to think about and then create, while the simple plots remain simple.

*** ggplot2 : the very best data visualization tool in R [author]

(Picture from https://plotnine.readthedocs.io/en/stable/gallery.html)

Interactive chart:

Plotly’s Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, ubplots, multiple-axes, polar charts, and bubble charts.

(Picture from https://plotly.com/python/)

Bokeh is a Python library for creating interactive visualizations for modern web browsers. It helps you build beautiful graphics, ranging from simple plots to complex dashboards with streaming datasets. With Bokeh, you can create JavaScript-powered visualizations without writing any JavaScript yourself.

(Picture from https://docs.bokeh.org/en/latest/docs/gallery.html)

folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the leaflet.js library. Manipulate your data in Python, then visualize it in on a Leaflet map via folium.

The library has a number of built-in tilesets from OpenStreetMap, Mapbox, and Stamen, and supports custom tilesets with Mapbox or Cloudmade API keys. folium supports both Image, Video, GeoJSON and TopoJSON overlays.

3. Feature engineering

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities. We will discuss about data reprocessing for feature engineer from scikit learn:

(Picture from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)

From picture above, we can use lot of popular method for feature engineer in scikit learn: one hot encoder, label encoder, scale, power_transform…However from my point of view, the recipe package (R) is easy to use with lot of pre-build function than scikit learn . You should try recipe package to learn more. Picture below : number of reprocessing data included in the recipe package

(Picture from: https://recipes.tidymodels.org/articles/Simple_Example.html)

Featuretools is an open source python framework for automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.

Please see demo of featuretools in this link

(Picture from https://www.featuretools.com/demos/)

Feature tool workflow:

(Picture from https://github.com/Featuretools/predict-customer-churn/blob/main/churn/5.%20Modeling.ipynb)

4. Build & Validate machine learning model

Manual set up & tune machine learning:

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

(Picture from https://scikit-learn.org/stable/index.html)

Tutorial: https://scikit-learn.org/stable/auto_examples/index.html

Scikit -learn wrapper for Keras: https://github.com/adriangb/scikeras

Scikit -learn wrapper for Pytorch: https://skorch.readthedocs.io/en/stable/

[Definition Link] H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.

H2O’s core code is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines. The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading. H2O’s REST API allows access to all the capabilities of H2O from an external program or script via JSON over HTTP. The Rest API is used by H2O’s web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python).

The speed, quality, ease-of-use, and model-deployment for the various cutting edge Supervised and Unsupervised algorithms like Deep Learning, Tree Ensembles, and GLRM make H2O a highly sought after API for big data data science.

Please see H2O user guide for Python and R in this link

[Definition link] Keras is an API designed for human beings, not machines. Keras follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear & actionable error messages. It also has extensive documentation and developer guides.

Keras tutorial : https://keras.io/examples/ (Audio, video, text, image, structure data, time series, RL, GAN…)

Keras tutorial for R user (R interface to Keras): https://keras.rstudio.com/articles/examples/index.html

[Wiki] TensorFlow is a free and open-source software library for machine learning. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. TensorFlow was developed by the Google Brain team for internal Google use. It was released under the Apache License 2.0 in 2015

Tensorflow is a very best framework for deep learning application (NLP, Computer vision, Reinforcement learning….).

Tensorflow : Please see this tutorial (https://www.tensorflow.org/tutorials) for Tensorflow application example (images, text, audio, structured data, Generative, Interpretability, Reinforcement learning….)

Tensorflow for R users : https://tensorflow.rstudio.com/tutorials/

[Wiki] PyTorch is an open source machine learning library based on the Torch library,used for applications such as computer vision and natural language processing,primarily developed by Facebook’s AI Research lab (FAIR).It is free and open-source software released under the Modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface

PyTorch tutorial : https://pytorch.org/tutorials/ (image, video, audio, text…)

rTorch tutorial (for R users) : https://f0nzie.github.io/rTorch/

Dask-ML provides scalable machine learning in Python using Dask alongside popular machine learning libraries like Scikit-Learn, XGBoost, and others.

Tutorial : https://ml.dask.org/

[Link] MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives

Tutorial: https://spark.apache.org/docs/1.2.1/mllib-guide.html (available for Scala, Python, Java)

Automated set up & tune machine learning:

[Link] PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment.

Please see Pycaret tutorial in this link : (Anomaly detection, binary classification, Clustering, Multi-class Classification, NLP, Regression ….)

[Link] auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator:

import autosklearn.classification
cls = autosklearn.classification.AutoSklearnClassifier()
cls.fit(X_train, y_train)
predictions = cls.predict(X_test)

Tutorial: link

[Link] EvalML is an AutoML library that builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.

Combined with Featuretools and Compose, EvalML can be used to create end-to-end supervised machine learning solutions.

Tutorial: Link

[Link] Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Tutorial: Link

[Link] AutoKeras: An AutoML system based on Keras. It is developed by DATA Lab at Texas A&M University. The goal of AutoKeras is to make machine learning accessible to everyone.

import autokeras as ak  
clf = ak.ImageClassifier()
clf.fit(x_train, y_train)
results = clf.predict(x_test)

Tutorial: Link

[Link] H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles — one based on all previously trained models, another one on the best model of each family — will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

H2O offers a number of model explainability methods that apply to AutoML objects (groups of models), as well as individual models (e.g. leader model). Explanations can be generated automatically with a single function call, providing a simple interface to exploring and explaining the AutoML models.

Tutorial in Python and R: Link

Example from: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#code-examplesimport h2o
from h2o.automl import H2OAutoML

h2o.init()

# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)

[Link] AutoGluon enables easy-to-use and easy-to-extend AutoML with a focus on automated stack ensembling, deep learning, and real-world applications spanning text, image, and tabular data. Intended for both ML beginners and experts

Tutorial : Link

[Link] Gluon Time Series (GluonTS) is the Gluon toolkit for probabilistic time series modeling, focusing on deep learning-based models.

GluonTS provides utilities for loading and iterating over time series datasets, state of the art models ready to be trained, and building blocks to define your own models and quickly experiment with different solutions.

Tutorial: Link

(Picture from this link : https://github.com/awslabs/gluon-ts/)

5. Explain model

  • Dalex package (please refer very best book for this wonderful package in this link)

Explaining machine learning model is very important step because it will help us understanding why our model predict the result. With Dalex package, the machine learning is not a black box anymore, we can find which important features affect our prediction and make recommendation or explanation to our stakeholder / customer.

According to Dalex package, we have two explain levels: instance level (one sample) and dataset level (whole samples). Instance level includes break down plots, shapley additive explanations, Lime, ceteris-paribus profiles, ceteris-paribus oscillations, local-diagnostics plots. Dataset level includes model performance, variable importance, partial dependence profiles, local dependence and accumulated local profiles, residual-diagnostic plots

Overview Dalex package (picture link from https://ema.drwhy.ai/introduction.html)

Instance level explanation (picture link: https://ema.drwhy.ai/summaryInstanceLevel.html)

Dataset level explanation (picture link: https://ema.drwhy.ai/summaryModelLevel.html)

6. Communicate results

(Picture from this link: https://jupyter.org/)

[Link] JupyterLab is a web-based interactive development environment for Jupyter notebooks, code, and data. JupyterLab is flexible: configure and arrange the user interface to support a wide range of workflows in data science, scientific computing, and machine learning. JupyterLab is extensible and modular: write plugins that add new components and integrate with existing ones.

(Picture from this link: https://jupyter.org/)

[Link] Jupyter NBViewer is the web application behind The Jupyter Notebook Viewer, which is graciously hosted by OVHcloud.

Please note: with simple notebook you can share with github link with your code, however, with large notebooks and complex javascript libraries, this nbviewer is a better choice

[Link] Voilà turns Jupyter notebooks into standalone web applications. Unlike the usual HTML-converted notebooks, each user connecting to the Voilà tornado application gets a dedicated Jupyter kernel which can execute the callbacks to changes in Jupyter interactive widgets.

See voila gallery in this link : https://voila-gallery.org/

(Picture from https://voila-gallery.org/)

Voila tutorial: Link

7. Deployment (web app)

Complex level increase: Streamlit → Dash → Flask

Streamlit turns data scripts into shareable web apps in minutes.
All in Python. All for free. No front‑end experience required.

[Link] Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. In just a few minutes you can build and deploy powerful data apps

My point of view: although with streamlit we can build web app very fast with only python language (for front end and back end) but Shiny (R) is still superior with many strong supportive features for web app development. The main drawback of Shiny is that you should learn R :-)

Tutorial: Link

Streamlit Galery: Link

(Picture form https://streamlit.io/gallery)

[Link] Dash apps give a point-&-click interface to models written in Python, R, and Julia — vastly expanding the notion of what’s possible in a traditional “dashboard.” With Dash apps, data scientists and engineers put complex Python analytics in the hands of business decision makers and operators.

Python has taken over the world, and Dash Enterprise is the leading vehicle for delivering Python analytics to business users.Traditional BI dashboards no longer cut it in today’s AI and ML driven world. Production-grade, low-code Python apps are needed for the complex analytics of emerging industries such as autonomous vehicles, renewable energy, quantum computing, novel therapeutics, and more

My point of view: Dash is a powerful tool for web app development (mostly in pure python language) but it is more complex than streamlit. I suggest we can use streamlit for simple app and Dash for complex app

Dash tutorial: https://dash.plotly.com/

Dash gallery: https://dash-gallery.plotly.host/Portal/

(Picture from https://dash-gallery.plotly.host/Portal/)

[Link] Flask is a lightweight WSGI web application framework. It is designed to make getting started quick and easy, with the ability to scale up to complex applications. It began as a simple wrapper around Werkzeug and Jinja and has become one of the most popular Python web application frameworks.

Flask tutorial: https://flask.palletsprojects.com/en/1.1.x/tutorial/

--

--

duy ngọc
Geek Culture

I'm a data scientist looking forward to using algorithms to make the world a better place