Automated
Machine Learning

AutoML is the automation of ML algorithms and the structured design process of a defined model. It provides predesigned systematically structured data analysis tools that help banks...

Automated Machine Learning

AutoML is the automation of ML algorithms and the structured design process of a defined model. It provides predesigned systematically structured data analysis tools that help banks and financial institutions obtain the best ML algorithm practices for accurate predictions with low cost and a short period. With the AutoML, a bank or a financial institution can derive the same results in less time at a lower cost. As the datasets of different algorithm applications are used and tested by various data scientists as well as being coded and recorded previously, it provides a predesigned data analysis structure that helps to apply the right algorithm with perfect tuning framed settings that reduce data scientist’s quality time when providing accurate results.

Executive Summary

Vector ML Platform uses ML algorithms for predicting prepayment and credit default behavior. As with traditional ML models, the datasets needed to test various algorithms with different tuning settings to make accurate results, which may take long periods and huge investments. AutoML is a fundamental shift of how all sizes of businesses use, develop, and implement ML algorithms that drive growth. With the feasibility of using predefined systems, the work can be completed by automatically using the platform.

Automated Machine Learning Benefits

It is considered that data scientists spend 60% of their time on cleaning and organizing datasets and 19% on collecting datasets. AutoML reduces the time that they spend solving critical problems. In building a ML model, the data scientist follows sequential traditional steps like collecting raw data, analyzing and filtering raw data, selecting the algorithm, training and tuning the algorithm, testing the algorithm function for acquiring results and repeating the process until they find the best algorithm. As there is no best algorithm for solving a problem, the data science team needs to figure out the right algorithm using feasible data.

AutoML Process

1.Preprocessing

2.Feature Engineering

3. Feature Selection

4. Model Selection and Hyperparameter Tuning

Traditional Machine Learning Workflow

AutoML Workflow

1. Preprocessing

Standardization

There are certain algorithms which assume that the input data has the Gaussian distribution, such as linear regression, logistic regression and linear discriminant analysis. They are biased towards the features which have a wider range. Standardization is the technique used to prevent features with wider ranges from dominating the distance metric used in such algorithms. There are multiple standardization techniques that can be used:

MinMax Scaler
Standard Scaler
Power Transformer Scaler
Unit Vector Scaler

The MinMax Scaler has been implemented as a standardization technique in the platform.

Application

In real-world data, it is almost always the case that a particular element is absent because of various reasons. Handling the missing values is a crucial step in the ML pipeline since choosing the right strategy can help the ML model perform significantly better. Below are the steps for handling the missing values:

Dropping the entire row if it has a missing value
Replacing with mean/median/mode
Replacing with the most frequent value for the categorical variables
Predicting the missing values using some learning algorithm such as linear regression
Using algorithms that handle missing values

Categorical Variables Encoding

Label encoding is used to encode the categorical variables.

2. Feature Engineering

Feature engineering is the most important step to improve the ML algorithm performance. It is the process of using domain knowledge to increase the predictive power of the ML algorithm.

Featuretools

Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrixes for ML. Featuretools is based on the technique known as Deep Feature Synthesis.

Deep Feature Synthesis (DFS)

DFS is an algorithm that automatically generates features for relational datasets. DFS performs feature engineering by taking the relationship among multiple tables as input and then applying different mathematical operations to generate meaningful features.

3. Feature Selection

Feature selection is the process of selecting the most relevant features. The goal of every ML practitioner is to always come up with the simplest model with the least amount of features. ML algorithms generalize well when there are the least amount of features. Below are the steps for feature selection:

Remove feature with low variance as low variance implies less information;
Feature selection using tree-based algorithms such as decision trees, random forests and others can be used to select the most important features;
Features having very high correlations are removed.

4. Model Selection and Hyperparameter Tuning

There are multiple ML algorithms that can be used for the final prediction, but there is no single algorithm that will have the best performance on every problem. AutoML is the automated process of training and selecting the best algorithm and its hyperparameters. Auto-Sklearn has been used as the automated model selection tool.

Auto-Sklearn architecture consists of three components: meta-learning, Bayesian optimization and ensemble selection. Meta-learning in ML refers to ML algorithms that learn from the output of other ML algorithms. The goal of meta-learning aims to reduce the space search by learning from the models that already performed well on similar datasets. Hereafter, Bayesian models are created for finding the most optimal pipeline. Finally, an ensemble model is created by using the best performing models found in the Bayesian optimization step