Hyperparameter
Theoretically, a machine learning method or deep learning algorithm should be able to learn from data and adjust the parameters by itself, just like a human reads a book. In reality, it should be noted that just as people are different, models are also different from one to one, and hyperparameters are used to control how the model looks like. From the coding point of view, hyperparameters define the structure of the model and the way of learning data. For example, the depth of the decision tree and minimizing the number of samples on a leaf can be considered as hyperparameters, as it shapes how the tree will be. Neural network can be designed with different number of layers, and the number of neurons in each hidden layer can also be flexible. Moreover, the learning rate and optimization method can also have many options. In most of the scenario, the choice of these hyperparameters could significantly affect the performance, and the adjustment of these parameters, therefore, is important in model training.
Automatic Optimization
There are many ways to optimize the hyperparameters, including manual adjustment by your hand, or other automatic ways that we may prefer to do so. The “automatic optimization” essentially just does the same thing as humans, trying different settings on the model and checking the improvements. There are many ways to optimize the hyperparameters, such as grid search, random search, Bayesian optimization, genetic algorithm. Grid search and random search are the most basic. It basically just tries all combinations of settings. While the other methods are more efficient to adjust the hyperparameters based on the loss, and has been demonstrated in this post.
Code Example
Here is an example where we use optuna
to optimize the hyperparameters.
Assume that we have data like:
Reactant1 | Reactant2 | Product | Additive | Solvent | Yield |
---|---|---|---|---|---|
c1ccc2c(c1)Nc1ccccc1O2 | Brc1ccccc1I | Brc1ccccc1N1c2ccccc2Oc2ccccc21 | CC(C)(C)[O-].CC(C)(C)PH+C(C)(C)C.FB-(F)F.FB-(F)F.O=C(C=Cc1ccccc1)C=Cc1ccccc1.O… | Cc1ccccc1 | 0.78 |
c1ccc2c(c1)Nc1ccccc1O2 | Brc1ccccc1I | Brc1ccccc1N1c2ccccc2Oc2ccccc21 | C1COCCOCCOCCOCCOCCO1.O=C([O-])[O-].[Cu+].[I-].[K+].[K+] | Clc1ccccc1Cl | 0.9 |
… | … | … | … | … | … |
Here we want to predict the yield of the product based on the reaction equation and conditions.
Feature Engineering
As the structure of molecule is just a bunch of string value which cannot be understood and processed by computer, we need to convert it into a series of number which can be done in many ways. Here we use rdkit
to convert it to MorganFingerprint.
import pandas as pd
from tqdm import tqdm # progress bar
from rdkit.Chem import rdMolDescriptors # convert molecule to number
from rdkit import RDLogger,Chem
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import numpy as np
import warnings
import optuna # hyperparameter optimizier
RDLogger.DisableLog('rdApp.*')
warnings.filterwarnings('ignore')
def mfgen(mol,nBits=2048, radius=2):
fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
mol,
radius=radius,
nBits=nBits
)
return np.array(list(map(eval,list(fp.ToBitString()))))
def vec_cpd_lst(smi_lst):
smi_set = list(set(smi_lst))
smi_vec_map = {}
for smi in tqdm(smi_set):
mol = Chem.MolFromSmiles(smi)
smi_vec_map[smi] = mfgen(mol)
smi_vec_map[''] = np.zeros(2048)
vec_lst = [smi_vec_map[smi] for smi in smi_lst]
return np.array(vec_lst)
def load_train():
DATA = pd.read_csv('train_data.csv')
DATA_rct1_smi = DATA['Reactant1'].to_list()
DATA_rct2_smi = DATA['Reactant2'].to_list()
DATA_add_smi = DATA['Additive'].to_list()
DATA_sol_smi = DATA['Solvent'].to_list()
DATA_rct1_fp = vec_cpd_lst(DATA_rct1_smi)
DATA_rct2_fp = vec_cpd_lst(DATA_rct2_smi)
DATA_add_fp = vec_cpd_lst(DATA_add_smi)
DATA_sol_fp = vec_cpd_lst(DATA_sol_smi)
DATA_X = np.concatenate(
[
DATA_rct1_fp,
DATA_rct2_fp,
DATA_add_fp,
DATA_sol_fp
],
axis=1,
)
DATA_Y = DATA['Yield'].to_numpy()
return DATA_X, DATA_Y
DATA_X, DATA_Y = load_train()
Hyperparameter Optimization
Here we use optuna
to find the best settings for lightgbm
.
def objective(trial):
param = {
'objective': 'regression',
'metric': 'mae',
'verbosity': -1,
'boosting_type': 'gbdt',
'device' : 'gpu',
'seed': 2024,
'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
'num_leaves': trial.suggest_int('num_leaves', 2 ** 3, 2 ** 10),
'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
'learning_rate': trial.suggest_loguniform('learning_rate', 0.005, 0.2),
}
DATA_X, DATA_Y = np.load('DATA_X.npy'), np.load('DATA_Y.npy')
X_train, X_val, Y_train, Y_val = train_test_split(
DATA_X, DATA_Y,
shuffle=True,
test_size=0.2,
random_state=2024)
train_matrix = lgb.Dataset(X_train, label=Y_train)
valid_matrix = lgb.Dataset(X_val, label=Y_val)
model = lgb.train(
param,
train_matrix,
1000,
valid_sets=[train_matrix, valid_matrix],
categorical_feature=[],
callbacks=[lgb.early_stopping(100)]
)
preds = model.predict(X_val)
mae = mean_absolute_error(Y_val, preds)
return mae
study = optuna.create_study(
study_name='LighGBM_Param_Optim',
direction='minimize',
)
study.optimize(objective, n_trials=5, n_jobs=1,show_progress_bar=True)
In param
, we define the settings for lightgbm
, where trial.suggest_***
means that optuna
would optimize the corresponding parameter in the following range. For example,
trial.suggest_int('num_leaves', 2 ** 3, 2 ** 10),
means that the parameter is set between 8 and 1024, and can only be an integer number.
This code will run 5 times (n_trails
), and will give you the best hyperparameter combination with the lowest mae
(direction='minimize'
).
Then the best settings can be printed by
print('Best trial:', study.best_trial.params)
And can be saved to a json file using
import json
best_params = study.best_trial.params
filename = 'best_trial_params.json'
with open(filename, 'w') as f:
json.dump(best_params, f, indent=4)
print(f"Best trial parameters saved to {filename}")
Model Training
Once we have the best parameters, the model can be build with the following code.
def train(DATA_X, DATA_Y):
X_train, X_val, Y_train, Y_val = train_test_split(
DATA_X, DATA_Y,
shuffle=True,
test_size=0.2,
random_state=2024
)
train_matrix = lgb.Dataset(X_train, label=Y_train)
valid_matrix = lgb.Dataset(X_val, label=Y_val)
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': 'mae',
'min_child_weight': 6,
'num_leaves': 1000,
'lambda_l2': 0.5,
'feature_fraction': 0.7,
'bagging_fraction': 0.7,
'bagging_freq': 5,
'learning_rate': 0.005,
'seed': 2024,
'nthread' : 16,
'verbose' : -1,
'device' : 'gpu',
}
model = lgb.train(
lgb_params,
train_matrix,
10000,
valid_sets=[train_matrix, valid_matrix],
categorical_feature=[],
callbacks=[lgb.log_evaluation(100),lgb.early_stopping(1000)]
)
val_pred = model.predict(
X_val,
num_iteration=model.best_iteration
)
score = mean_squared_error(val_pred, Y_val)
print('Validation Set Score:',score)
model.save_model('lgbr_best.txt',num_iteration=model.best_iteration)
return model
model = train(DATA_X, DATA_Y)
Prediction
And predict the value of the yield for each reaction by
def load_test():
DATA = pd.read_csv('test_data.csv')
DATA_rct1_smi = DATA['Reactant1'].to_list()
DATA_rct2_smi = DATA['Reactant2'].to_list()
DATA_add_smi = DATA['Additive'].to_list()
DATA_sol_smi = DATA['Solvent'].to_list()
DATA_rct1_fp = vec_cpd_lst(DATA_rct1_smi)
DATA_rct2_fp = vec_cpd_lst(DATA_rct2_smi)
DATA_add_fp = vec_cpd_lst(DATA_add_smi)
DATA_sol_fp = vec_cpd_lst(DATA_sol_smi)
DATA_X = np.concatenate([DATA_rct1_fp,DATA_rct2_fp,DATA_add_fp,DATA_sol_fp],axis=1)
return DATA_X
DATA_X = load_test()
model = lgb.Booster(model_file='lgbr_best.txt')
prediction = model.predict(DATA_X, num_iteration=model.best_iteration)