YAML basics

4 min readFeb 8, 2022

Source

https://www.udemy.com/course/deployment-of-machine-learning-models/
You can check the github repo of the Udemy course to access all files and scripts
https://hitchdev.com/strictyaml/why-not/turing-complete-code/
https://stackabuse.com/reading-and-writing-yaml-to-a-file-in-python/

Config with YAML

It is a good idea to move all global constants from the Feature Engineering scripts to a YAML config file (instead of being listed within the py script file). The idea behind it is that you want to use the less powerful programming language available. Overall, the goal is to avoid bugs while making it easy by other languages to access the information. It also means you have one place where all your constants are saved which is pretty convenient.

The basics of a Yaml file

# assign a value
package_name: regression_model# create a dictionary 
variables_to_rename:
 1stFlrSF: FirstFlrSF
 2ndFlrSF: SecondFlrSF
 3SsnPorch: ThreeSsnPortch# create a list
features:
 — MSSubClass
 — MSZoning
 — LotFrontage
 — LotShape

Reading and saving data in a Yaml file

# simple test to read yaml files
# https://stackabuse.com/reading-and-writing-yaml-to-a-file-in-python/# pip install pyyaml
import yaml# to read yaml data
with open(r'C:\Users\XXX\TEST_yml\config.yml') as file:
    # The FullLoader parameter handles the conversion from YAML
    # scalar values to Python the dictionary format 
    documents = yaml.full_load(file)for item, doc in documents.items():
        print(item, ":", doc)print(documents['features'])# to save data in yaml format
# note that if you already have data in the file it would be erased and replaced by the below
dict_file = [{'sports' : ['soccer', 'football', 'basketball', 'cricket', 'hockey', 'table tennis']},
{'countries' : ['Pakistan', 'USA', 'India', 'China', 'Germany', 'France', 'Spain']}]with open(r'C:\Users\XXX\TEST_yml\config.yml', 'w') as file:
    documents = yaml.dump(dict_file, file)

What model related information to store in a Yaml file

We can have an example with the Udemy “deployment of machine learning models” course. Basically, they store in a Config.yml file information related to:

Package name
Files names (train.csv, test.csv)
Any features list needed for a Feature engineering technique (numerical_vars_with_na, temporal_vars)
mapping for transforming features
set train/test split (test_size: 0.1)
random seed value
ML constant (alpha, etc.)

Example of the course Config.yml script located in the regression_model’s folder. We can see it covers a lot of different constant types.

# Package Overview
package_name: regression_model# Data Files
training_data_file: train.csv
test_data_file: test.csv# Variables
# The variable we are attempting to predict (sale price)
target: SalePricepipeline_name: regression_model
pipeline_save_file: regression_model_output_v# Will cause syntax errors since they begin with numbers
variables_to_rename:
 1stFlrSF: FirstFlrSF
 2ndFlrSF: SecondFlrSF
 3SsnPorch: ThreeSsnPortchfeatures:
 — MSSubClass
 — MSZoning
 # — […]# set train/test split
test_size: 0.1# to set the random seed
random_state: 0alpha: 0.001# categorical variables with NA in train set
categorical_vars_with_na_frequent:
 — BsmtQual
 — BsmtExposure
 — BsmtFinType1
 — GarageFinishcategorical_vars_with_na_missing:
 — FireplaceQunumerical_vars_with_na:
 — LotFrontagetemporal_vars:
 — YearRemodAddref_var: YrSold# variables to log transform
numericals_log_vars:
 — LotFrontage
 — FirstFlrSF
 — GrLivAreabinarize_vars:
 — ScreenPorch# variables to map
qual_vars:
 — ExterQual
 — BsmtQual
 — HeatingQC
 — KitchenQual
 — FireplaceQuexposure_vars:
 — BsmtExposurefinish_vars:
 — BsmtFinType1garage_vars:
 — GarageFinishcategorical_vars:
 — MSSubClass
 — MSZoning
 # — […]# variable mappings
qual_mappings:
 Po: 1
 Fa: 2
 TA: 3
 Gd: 4
 Ex: 5
 Missing: 0
 NA: 0exposure_mappings:
 No: 1
 Mn: 2
 Av: 3
 Gd: 4finish_mappings:
 Missing: 0
 NA: 0
 Unf: 1
 LwQ: 2
 Rec: 3
 BLQ: 4
 ALQ: 5
 GLQ: 6garage_mappings:
 Missing: 0
 NA: 0
 Unf: 1
 RFn: 2
 Fin: 3

Yaml files and data validation

It is also possible to ensure you bring the correct data type of a Yaml file by using Pydentic. This python library is used for data validation and setting management. So we define classes which inherit from the pydantic BaseModel. Within one of these pydantic base model classes, we define class attributes and specify a type in, so that when we use this to load data, we can run validation against the expected type of each individual attribute.

In order to incorporate the yml file into the FE pipeline.py script, we have to create a core.py script which is located in the config folder.

/regression_model/config/core.py script

from pathlib import Path # similar to OS, to define location path and directories
from typing import Dict, List, Sequence
from pydantic import BaseModel # used to define classes which represent our config
from strictyaml import YAML, load
import regression_model# Project Directories: define where config.yml, datasets, etc, are 
PACKAGE_ROOT = Path(regression_model.__file__).resolve().parent
ROOT = PACKAGE_ROOT.parent
CONFIG_FILE_PATH = PACKAGE_ROOT / "config.yml"
DATASET_DIR = PACKAGE_ROOT / "datasets"
TRAINED_MODEL_DIR = PACKAGE_ROOT / "trained_models"class AppConfig(BaseModel):
    """
    Application-level config.
    """
    package_name: str
    training_data_file: str
    test_data_file: str
    pipeline_save_file: strclass ModelConfig(BaseModel):
    """
    All configuration relevant to model training and feature engineering.
    Data type is specified for running data type validation in pydantic 
    """
    target: str
    variables_to_rename: Dict
    features: List[str]
    test_size: float
    random_state: int
    alpha: float
    _with_na_frequent: List[str]
    categorical_vars_with_na_missing: List[str]
    numerical_vars_with_na: List[str]
    temporal_vars: List[str]
    ref_var: str
    numericals_log_vars: Sequence[str]
    binarize_vars: Sequence[str]
    qual_vars: List[str]
    exposure_vars: List[str]
    finish_vars: List[str]
    garage_vars: List[str]
    categorical_vars: Sequence[str]
    qual_mappings: Dict[str, int]
    exposure_mappings: Dict[str, int]
    garage_mappings: Dict[str, int]
    finish_mappings: Dict[str, int]class Config(BaseModel):
    """Master config object."""
    app_config: AppConfig
    model_config: ModelConfigdef find_config_file() -> Path:
    """Locate the configuration file."""
    if CONFIG_FILE_PATH.is_file():
        return CONFIG_FILE_PATH
    raise Exception(f"Config not found at {CONFIG_FILE_PATH!r}")def fetch_config_from_yaml(cfg_path: Path = None) -> YAML:
    """Parse YAML containing the package configuration."""
    if not cfg_path:
        cfg_path = find_config_file()if cfg_path:
        with open(cfg_path, "r") as conf_file:
            parsed_config = load(conf_file.read())
            return parsed_config
    raise OSError(f"Did not find config file at path: {cfg_path}")def create_and_validate_config(parsed_config: YAML = None) -> Config:
    """Run validation on config values."""
    if parsed_config is None:
        parsed_config = fetch_config_from_yaml()# specify the data attribute from the strictyaml YAML type.
    _config = Config(
        app_config=AppConfig(**parsed_config.data),
        model_config=ModelConfig(**parsed_config.data),
    )
    return _configconfig = create_and_validate_config()

Values are then imported in the pipeline.py script using for example:

Uploading values from the yml file in pipeline.py script

from regression_model.config.core import configconfig.model_config.categorical_vars_with_na_missing

YAML basics

Source

Config with YAML

The basics of a Yaml file

Reading and saving data in a Yaml file

What model related information to store in a Yaml file

Yaml files and data validation

/regression_model/config/core.py script

Uploading values from the yml file in pipeline.py script

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Marc Deveaux

No responses yet