YAML basics
Source
- https://www.udemy.com/course/deployment-of-machine-learning-models/
- You can check the github repo of the Udemy course to access all files and scripts
- https://hitchdev.com/strictyaml/why-not/turing-complete-code/
- https://stackabuse.com/reading-and-writing-yaml-to-a-file-in-python/
Config with YAML
It is a good idea to move all global constants from the Feature Engineering scripts to a YAML config file (instead of being listed within the py script file). The idea behind it is that you want to use the less powerful programming language available. Overall, the goal is to avoid bugs while making it easy by other languages to access the information. It also means you have one place where all your constants are saved which is pretty convenient.
The basics of a Yaml file
# assign a value
package_name: regression_model# create a dictionary
variables_to_rename:
1stFlrSF: FirstFlrSF
2ndFlrSF: SecondFlrSF
3SsnPorch: ThreeSsnPortch# create a list
features:
— MSSubClass
— MSZoning
— LotFrontage
— LotShape
Reading and saving data in a Yaml file
# simple test to read yaml files
# https://stackabuse.com/reading-and-writing-yaml-to-a-file-in-python/# pip install pyyaml
import yaml# to read yaml data
with open(r'C:\Users\XXX\TEST_yml\config.yml') as file:
# The FullLoader parameter handles the conversion from YAML
# scalar values to Python the dictionary format
documents = yaml.full_load(file)for item, doc in documents.items():
print(item, ":", doc)print(documents['features'])# to save data in yaml format
# note that if you already have data in the file it would be erased and replaced by the below
dict_file = [{'sports' : ['soccer', 'football', 'basketball', 'cricket', 'hockey', 'table tennis']},
{'countries' : ['Pakistan', 'USA', 'India', 'China', 'Germany', 'France', 'Spain']}]with open(r'C:\Users\XXX\TEST_yml\config.yml', 'w') as file:
documents = yaml.dump(dict_file, file)
What model related information to store in a Yaml file
We can have an example with the Udemy “deployment of machine learning models” course. Basically, they store in a Config.yml file information related to:
- Package name
- Files names (train.csv, test.csv)
- Any features list needed for a Feature engineering technique (numerical_vars_with_na, temporal_vars)
- mapping for transforming features
- set train/test split (test_size: 0.1)
- random seed value
- ML constant (alpha, etc.)
Example of the course Config.yml script located in the regression_model’s folder. We can see it covers a lot of different constant types.
# Package Overview
package_name: regression_model# Data Files
training_data_file: train.csv
test_data_file: test.csv# Variables
# The variable we are attempting to predict (sale price)
target: SalePricepipeline_name: regression_model
pipeline_save_file: regression_model_output_v# Will cause syntax errors since they begin with numbers
variables_to_rename:
1stFlrSF: FirstFlrSF
2ndFlrSF: SecondFlrSF
3SsnPorch: ThreeSsnPortchfeatures:
— MSSubClass
— MSZoning
# — […]# set train/test split
test_size: 0.1# to set the random seed
random_state: 0alpha: 0.001# categorical variables with NA in train set
categorical_vars_with_na_frequent:
— BsmtQual
— BsmtExposure
— BsmtFinType1
— GarageFinishcategorical_vars_with_na_missing:
— FireplaceQunumerical_vars_with_na:
— LotFrontagetemporal_vars:
— YearRemodAddref_var: YrSold# variables to log transform
numericals_log_vars:
— LotFrontage
— FirstFlrSF
— GrLivAreabinarize_vars:
— ScreenPorch# variables to map
qual_vars:
— ExterQual
— BsmtQual
— HeatingQC
— KitchenQual
— FireplaceQuexposure_vars:
— BsmtExposurefinish_vars:
— BsmtFinType1garage_vars:
— GarageFinishcategorical_vars:
— MSSubClass
— MSZoning
# — […]# variable mappings
qual_mappings:
Po: 1
Fa: 2
TA: 3
Gd: 4
Ex: 5
Missing: 0
NA: 0exposure_mappings:
No: 1
Mn: 2
Av: 3
Gd: 4finish_mappings:
Missing: 0
NA: 0
Unf: 1
LwQ: 2
Rec: 3
BLQ: 4
ALQ: 5
GLQ: 6garage_mappings:
Missing: 0
NA: 0
Unf: 1
RFn: 2
Fin: 3
Yaml files and data validation
It is also possible to ensure you bring the correct data type of a Yaml file by using Pydentic. This python library is used for data validation and setting management. So we define classes which inherit from the pydantic BaseModel. Within one of these pydantic base model classes, we define class attributes and specify a type in, so that when we use this to load data, we can run validation against the expected type of each individual attribute.
In order to incorporate the yml file into the FE pipeline.py script, we have to create a core.py script which is located in the config folder.
/regression_model/config/core.py script
from pathlib import Path # similar to OS, to define location path and directories
from typing import Dict, List, Sequence
from pydantic import BaseModel # used to define classes which represent our config
from strictyaml import YAML, load
import regression_model# Project Directories: define where config.yml, datasets, etc, are
PACKAGE_ROOT = Path(regression_model.__file__).resolve().parent
ROOT = PACKAGE_ROOT.parent
CONFIG_FILE_PATH = PACKAGE_ROOT / "config.yml"
DATASET_DIR = PACKAGE_ROOT / "datasets"
TRAINED_MODEL_DIR = PACKAGE_ROOT / "trained_models"class AppConfig(BaseModel):
"""
Application-level config.
"""
package_name: str
training_data_file: str
test_data_file: str
pipeline_save_file: strclass ModelConfig(BaseModel):
"""
All configuration relevant to model training and feature engineering.
Data type is specified for running data type validation in pydantic
"""
target: str
variables_to_rename: Dict
features: List[str]
test_size: float
random_state: int
alpha: float
_with_na_frequent: List[str]
categorical_vars_with_na_missing: List[str]
numerical_vars_with_na: List[str]
temporal_vars: List[str]
ref_var: str
numericals_log_vars: Sequence[str]
binarize_vars: Sequence[str]
qual_vars: List[str]
exposure_vars: List[str]
finish_vars: List[str]
garage_vars: List[str]
categorical_vars: Sequence[str]
qual_mappings: Dict[str, int]
exposure_mappings: Dict[str, int]
garage_mappings: Dict[str, int]
finish_mappings: Dict[str, int]class Config(BaseModel):
"""Master config object."""
app_config: AppConfig
model_config: ModelConfigdef find_config_file() -> Path:
"""Locate the configuration file."""
if CONFIG_FILE_PATH.is_file():
return CONFIG_FILE_PATH
raise Exception(f"Config not found at {CONFIG_FILE_PATH!r}")def fetch_config_from_yaml(cfg_path: Path = None) -> YAML:
"""Parse YAML containing the package configuration."""
if not cfg_path:
cfg_path = find_config_file()if cfg_path:
with open(cfg_path, "r") as conf_file:
parsed_config = load(conf_file.read())
return parsed_config
raise OSError(f"Did not find config file at path: {cfg_path}")def create_and_validate_config(parsed_config: YAML = None) -> Config:
"""Run validation on config values."""
if parsed_config is None:
parsed_config = fetch_config_from_yaml()# specify the data attribute from the strictyaml YAML type.
_config = Config(
app_config=AppConfig(**parsed_config.data),
model_config=ModelConfig(**parsed_config.data),
)
return _configconfig = create_and_validate_config()
Values are then imported in the pipeline.py script using for example:
Uploading values from the yml file in pipeline.py script
from regression_model.config.core import configconfig.model_config.categorical_vars_with_na_missing