YAML basics

Marc Deveaux
4 min readFeb 8, 2022

--

Photo by YUCAR FotoGrafik on Unsplash

Source

Config with YAML

It is a good idea to move all global constants from the Feature Engineering scripts to a YAML config file (instead of being listed within the py script file). The idea behind it is that you want to use the less powerful programming language available. Overall, the goal is to avoid bugs while making it easy by other languages to access the information. It also means you have one place where all your constants are saved which is pretty convenient.

The basics of a Yaml file

# assign a value
package_name: regression_model
# create a dictionary
variables_to_rename:
1stFlrSF: FirstFlrSF
2ndFlrSF: SecondFlrSF
3SsnPorch: ThreeSsnPortch
# create a list
features:
— MSSubClass
— MSZoning
— LotFrontage
— LotShape

Reading and saving data in a Yaml file

# simple test to read yaml files
# https://stackabuse.com/reading-and-writing-yaml-to-a-file-in-python/
# pip install pyyaml
import yaml
# to read yaml data
with open(r'C:\Users\XXX\TEST_yml\config.yml') as file:
# The FullLoader parameter handles the conversion from YAML
# scalar values to Python the dictionary format
documents = yaml.full_load(file)
for item, doc in documents.items():
print(item, ":", doc)
print(documents['features'])# to save data in yaml format
# note that if you already have data in the file it would be erased and replaced by the below
dict_file = [{'sports' : ['soccer', 'football', 'basketball', 'cricket', 'hockey', 'table tennis']},
{'countries' : ['Pakistan', 'USA', 'India', 'China', 'Germany', 'France', 'Spain']}]
with open(r'C:\Users\XXX\TEST_yml\config.yml', 'w') as file:
documents = yaml.dump(dict_file, file)

What model related information to store in a Yaml file

We can have an example with the Udemy “deployment of machine learning models” course. Basically, they store in a Config.yml file information related to:

  • Package name
  • Files names (train.csv, test.csv)
  • Any features list needed for a Feature engineering technique (numerical_vars_with_na, temporal_vars)
  • mapping for transforming features
  • set train/test split (test_size: 0.1)
  • random seed value
  • ML constant (alpha, etc.)

Example of the course Config.yml script located in the regression_model’s folder. We can see it covers a lot of different constant types.

# Package Overview
package_name: regression_model
# Data Files
training_data_file: train.csv
test_data_file: test.csv
# Variables
# The variable we are attempting to predict (sale price)
target: SalePrice
pipeline_name: regression_model
pipeline_save_file: regression_model_output_v
# Will cause syntax errors since they begin with numbers
variables_to_rename:
1stFlrSF: FirstFlrSF
2ndFlrSF: SecondFlrSF
3SsnPorch: ThreeSsnPortch
features:
— MSSubClass
— MSZoning
# — […]
# set train/test split
test_size: 0.1
# to set the random seed
random_state: 0
alpha: 0.001# categorical variables with NA in train set
categorical_vars_with_na_frequent:
— BsmtQual
— BsmtExposure
— BsmtFinType1
— GarageFinish
categorical_vars_with_na_missing:
— FireplaceQu
numerical_vars_with_na:
— LotFrontage
temporal_vars:
— YearRemodAdd
ref_var: YrSold# variables to log transform
numericals_log_vars:
— LotFrontage
— FirstFlrSF
— GrLivArea
binarize_vars:
— ScreenPorch
# variables to map
qual_vars:
— ExterQual
— BsmtQual
— HeatingQC
— KitchenQual
— FireplaceQu
exposure_vars:
— BsmtExposure
finish_vars:
— BsmtFinType1
garage_vars:
— GarageFinish
categorical_vars:
— MSSubClass
— MSZoning
# — […]
# variable mappings
qual_mappings:
Po: 1
Fa: 2
TA: 3
Gd: 4
Ex: 5
Missing: 0
NA: 0
exposure_mappings:
No: 1
Mn: 2
Av: 3
Gd: 4
finish_mappings:
Missing: 0
NA: 0
Unf: 1
LwQ: 2
Rec: 3
BLQ: 4
ALQ: 5
GLQ: 6
garage_mappings:
Missing: 0
NA: 0
Unf: 1
RFn: 2
Fin: 3

Yaml files and data validation

It is also possible to ensure you bring the correct data type of a Yaml file by using Pydentic. This python library is used for data validation and setting management. So we define classes which inherit from the pydantic BaseModel. Within one of these pydantic base model classes, we define class attributes and specify a type in, so that when we use this to load data, we can run validation against the expected type of each individual attribute.

In order to incorporate the yml file into the FE pipeline.py script, we have to create a core.py script which is located in the config folder.

/regression_model/config/core.py script

from pathlib import Path # similar to OS, to define location path and directories
from typing import Dict, List, Sequence
from pydantic import BaseModel # used to define classes which represent our config
from strictyaml import YAML, load
import regression_model
# Project Directories: define where config.yml, datasets, etc, are
PACKAGE_ROOT = Path(regression_model.__file__).resolve().parent
ROOT = PACKAGE_ROOT.parent
CONFIG_FILE_PATH = PACKAGE_ROOT / "config.yml"
DATASET_DIR = PACKAGE_ROOT / "datasets"
TRAINED_MODEL_DIR = PACKAGE_ROOT / "trained_models"
class AppConfig(BaseModel):
"""
Application-level config.
"""
package_name: str
training_data_file: str
test_data_file: str
pipeline_save_file: str
class ModelConfig(BaseModel):
"""
All configuration relevant to model training and feature engineering.
Data type is specified for running data type validation in pydantic
"""
target: str
variables_to_rename: Dict
features: List[str]
test_size: float
random_state: int
alpha: float
_with_na_frequent: List[str]
categorical_vars_with_na_missing: List[str]
numerical_vars_with_na: List[str]
temporal_vars: List[str]
ref_var: str
numericals_log_vars: Sequence[str]
binarize_vars: Sequence[str]
qual_vars: List[str]
exposure_vars: List[str]
finish_vars: List[str]
garage_vars: List[str]
categorical_vars: Sequence[str]
qual_mappings: Dict[str, int]
exposure_mappings: Dict[str, int]
garage_mappings: Dict[str, int]
finish_mappings: Dict[str, int]
class Config(BaseModel):
"""Master config object."""
app_config: AppConfig
model_config: ModelConfig
def find_config_file() -> Path:
"""Locate the configuration file."""
if CONFIG_FILE_PATH.is_file():
return CONFIG_FILE_PATH
raise Exception(f"Config not found at {CONFIG_FILE_PATH!r}")
def fetch_config_from_yaml(cfg_path: Path = None) -> YAML:
"""Parse YAML containing the package configuration."""
if not cfg_path:
cfg_path = find_config_file()
if cfg_path:
with open(cfg_path, "r") as conf_file:
parsed_config = load(conf_file.read())
return parsed_config
raise OSError(f"Did not find config file at path: {cfg_path}")
def create_and_validate_config(parsed_config: YAML = None) -> Config:
"""Run validation on config values."""
if parsed_config is None:
parsed_config = fetch_config_from_yaml()
# specify the data attribute from the strictyaml YAML type.
_config = Config(
app_config=AppConfig(**parsed_config.data),
model_config=ModelConfig(**parsed_config.data),
)
return _config
config = create_and_validate_config()

Values are then imported in the pipeline.py script using for example:

Uploading values from the yml file in pipeline.py script

from regression_model.config.core import configconfig.model_config.categorical_vars_with_na_missing

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Marc Deveaux
Marc Deveaux

No responses yet

Write a response