Pydantic library introduction

Source

https://pydantic-docs.helpmanual.io/

https://github.com/mikeckennedy/python-shorts/blob/main/shorts/02-parse-validate-with-pydantic/code/receive_data.py

https://github.com/trainindata/deploying-machine-learning-models/blob/master/section-05-production-model-package/regression_model/processing/validation.py

https://lyz-code.github.io/blue-book/coding/python/pydantic/

https://www.youtube.com/watch?v=Nlhp4EmE55I

https://www.inwt-statistics.com/read-blog/pandas-dataframe-validation-with-pydantic.html

What problem does Pydantic solve?

  • Pydantic is used for data validation. For example, typically when you import a dataset for modeling, you want to make sure all the data types are as expected to avoid errors that could come up in the feature engineering pipeline. Pydantic validates the data type and in case of error, provide you useful feedback on where the error came from. In some cases, it can also do autmatically some parsing (like {1,”4",6} would be parsed automatically to {1,4,6} if it was predefined as integer data type.
  • It replaces the basic DataClass (part of python core modules) and offer more functionalities
  • It also offers functionalities to import data in JSON format and pass it through data parsing and validation easily

The basic usage

from typing import Optional
from pydantic import BaseModel
data = {
"name": "Michael Kennedy",
"age": "28",
"location": {
"city": "Portland",
"state": "Oregon"
},
"bike": "KTM Duke 690",
"rides": [7, 103, 22, "70", 1000]
}
class Location(BaseModel):
city: str
state: str
country: Optional[str]
class User(BaseModel):
name: str
age: int
location: Location
bike: str
rides: list[int] = []
user = User(**data)print(f"Found a user: {user}")

From above code:

  • you create a class User using base_model where you define the data type of your features. Note that location feature is defined in another class.
  • Optional[str] indicates that value is optional but if it is there, it has to be a string. You have to import Optional from the typing library
  • You can put “location = None” if you don’t know what the data type will be
  • you can set up default value like country: str = “USA”
  • rides: list[int] = [] means list of integer and it is ok if you don’t have any rides and the list is empty
  • user = User(data) check the data types and parse data. In this example, rides “70” will be automatically converted to 70. If there is an error, you will have useful comment to find where is it coming from

Example with additional data validation rules

We can add additional rules for any of the features, such as number should be positive, string should not be more than X characters, etc..

import pydantic
from typing import Optional
class User(pydantic.BaseModel):
username: str
passworkd: str
age: int
score: float
email: Optional[str]
phone_number: Optional[str]

# new rules, first for feature username and then for password
@pydantic.validator("username")
@clasmethod
def username_valid(cls, value): # cls the class method
if any(p in value for p in string.punctuation):
raise ValueError("No punctuation!")
else:
return value

@pydantic.validator("password")
@classmethod
def password_valid(cls, value):
if len(value) < 8:
raise ValueError("Pwd must be at least 8 char long")
if any(p in value for p in string.punctuation):
if any(d in value for d in string.digits):
if any(l in value for l in string.ascii_lowercase):
if any(u in value for u in string.ascii_uppercase):
return value
raise ValueError("password needs at least one punctuation symbol, digit, etc.")

# new rule for 2 features at once
@pydantic.validator("age", "score")
@classmethod
def number_valid(cls, value):
if value >= 0:
return value
else:
raise ValueError("Numbers must be positive")

user1 = User(username="user_1", password="12345", age=20, score=0, email="my@mail.com")
print(user1)
print(user1.age)

Combining Pydantic and Pandas

To validate DataFrames, we need to transform the DataFrame to dict with the following code: pd.DataFrame().to_dict(orient=”records”) which will create a dictionary we can pass to the Pydantic class

import pandas as pd
from pydantic import BaseModel, Field, ValidationError
from typing import List
class DictValidator(BaseModel):
id: int = Field(..., ge=1)
name: str = Field(..., max_length=20)
height: float = Field(..., ge=0, le=250, description="Height in cm.")
person_data = pd.DataFrame([{"id": 1, "name": "Sebastian", "height": 178},
{"id": 2, "name": "Max", "height": 218},
{"id": 3, "name": "Mustermann", "height": 151}])
print(person_data.to_dict(orient="records"))# now we can pass each of these dicts to the validator
class PdVal(BaseModel):
df_dict: List[DictValidator]
# you validate the data like this
PdVal(df_dict=person_data.to_dict(orient="records"))
# test generating an error because height for id 3 is > 250:
wrong_person_data = pd.DataFrame([{"id": 1, "name": "Sebastian", "height": 178},
{"id": 2, "name": "Max", "height": 218},
{"id": 3, "name": "Mustermann", "height": 251}])
try:
PdVal(df_dict=wrong_person_data.to_dict(orient="records"))
except ValidationError as e:
print(e)

Example of a data type validation check in a modeling project

Template for using ValidationError from pydentic:

from pydantic import ValidationError
try:
XXX
except ValidationError as err:
print(err)

See the following part of the code:

  • You define a class with all the data types in HouseDataInputSchema
  • HouseDataInputSchema is passed through list[HouseDataInputSchema] in the class MultipleHouseDataInputs
  • MultipleHouseDataInputs is passed in a function to validate the input after being transformed in dictionary format: MultipleHouseDataInputs(inputs=validated_data.replace({np.nan: None}).to_dict(orient=”records”))
  • replace({np.nan: None}) is done to avoid np.nan being not recognized by pydantic
from typing import List, Optional, Tuple
import numpy as np
import pandas as pd
from pydantic import BaseModel, ValidationError
from regression_model.config.core import config
def drop_na_inputs(*, input_data: pd.DataFrame) -> pd.DataFrame:
"""Check model inputs for na values and filter."""
validated_data = input_data.copy()
new_vars_with_na = [
var
for var in config.model_config.features
if var
not in config.model_config.categorical_vars_with_na_frequent
+ config.model_config.categorical_vars_with_na_missing
+ config.model_config.numerical_vars_with_na
and validated_data[var].isnull().sum() > 0
]
validated_data.dropna(subset=new_vars_with_na, inplace=True)
return validated_datadef validate_inputs(*, input_data: pd.DataFrame) -> Tuple[pd.DataFrame, Optional[dict]]:
"""Check model inputs for unprocessable values."""
# convert syntax error field names (beginning with numbers)
input_data.rename(columns=config.model_config.variables_to_rename, inplace=True)
input_data["MSSubClass"] = input_data["MSSubClass"].astype("O")
relevant_data = input_data[config.model_config.features].copy()
validated_data = drop_na_inputs(input_data=relevant_data)
errors = None
try:
# replace numpy nans so that pydantic can validate
MultipleHouseDataInputs(
inputs=validated_data.replace({np.nan: None}).to_dict(orient="records")
)
except ValidationError as error:
errors = error.json()
return validated_data, errorsclass HouseDataInputSchema(BaseModel):
Alley: Optional[str]
BedroomAbvGr: Optional[int]
BldgType: Optional[str]
BsmtCond: Optional[str]
BsmtExposure: Optional[str]
BsmtFinSF1: Optional[float]
BsmtFinSF2: Optional[float]
BsmtFinType1: Optional[str]
BsmtFinType2: Optional[str]
BsmtFullBath: Optional[float]
BsmtHalfBath: Optional[float]
BsmtQual: Optional[str]
BsmtUnfSF: Optional[float]
CentralAir: Optional[str]
Condition1: Optional[str]
Condition2: Optional[str]
Electrical: Optional[str]
EnclosedPorch: Optional[int]
ExterCond: Optional[str]
ExterQual: Optional[str]
Exterior1st: Optional[str]
Exterior2nd: Optional[str]
Fence: Optional[str]
FireplaceQu: Optional[str]
Fireplaces: Optional[int]
Foundation: Optional[str]
FullBath: Optional[int]
Functional: Optional[str]
GarageArea: Optional[float]
GarageCars: Optional[float]
GarageCond: Optional[str]
GarageFinish: Optional[str]
GarageQual: Optional[str]
GarageType: Optional[str]
GarageYrBlt: Optional[float]
GrLivArea: Optional[int]
HalfBath: Optional[int]
Heating: Optional[str]
HeatingQC: Optional[str]
HouseStyle: Optional[str]
Id: Optional[int]
KitchenAbvGr: Optional[int]
KitchenQual: Optional[str]
LandContour: Optional[str]
LandSlope: Optional[str]
LotArea: Optional[int]
LotConfig: Optional[str]
LotFrontage: Optional[float]
LotShape: Optional[str]
LowQualFinSF: Optional[int]
MSSubClass: Optional[int]
MSZoning: Optional[str]
MasVnrArea: Optional[float]
MasVnrType: Optional[str]
MiscFeature: Optional[str]
MiscVal: Optional[int]
MoSold: Optional[int]
Neighborhood: Optional[str]
OpenPorchSF: Optional[int]
OverallCond: Optional[int]
OverallQual: Optional[int]
PavedDrive: Optional[str]
PoolArea: Optional[int]
PoolQC: Optional[str]
RoofMatl: Optional[str]
RoofStyle: Optional[str]
SaleCondition: Optional[str]
SaleType: Optional[str]
ScreenPorch: Optional[int]
Street: Optional[str]
TotRmsAbvGrd: Optional[int]
TotalBsmtSF: Optional[float]
Utilities: Optional[str]
WoodDeckSF: Optional[int]
YearBuilt: Optional[int]
YearRemodAdd: Optional[int]
YrSold: Optional[int]
FirstFlrSF: Optional[int] # renamed
SecondFlrSF: Optional[int] # renamed
ThreeSsnPortch: Optional[int] # renamed
class MultipleHouseDataInputs(BaseModel):
inputs: List[HouseDataInputSchema]

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store