Skip to content

one_cold

OneCold(feature_categories, *, check_for_missing_categories=False)

Bases: Transform

Transforms integer features into a set of binary one-cold features.

Transforms integer features into a set of binary one-cold features. The original integer features are dropped and are not part of the resulting data frame.

As an example consider the following data

feature
0
1
2
1
5

When the one-cold transformation is applied, the result is as follows

feature_0 feature_1 feature_2 feature_5
0 1 1 1
1 0 1 1
1 1 0 1
1 0 1 1
1 1 1 0

In the default configuration missing categories are ignored. Their respective entries will all be one. If you however want to enforce that each data entry belongs to a certain category, you can set the check_for_missing_categories flag to true when constructing a One-Cold transform. In that case if an unknown value is found which does not belong to any category, a NoMatchingCategoryError is thrown. This however has an impact on the performance and will slow down the transform.

If you want to enable this check, create the transform as follows: python transform = OneCold( feature_categories={ "feature": [0, 1, 2, 5] }, check_for_missing_categories=True )

Initializes the One-Hot transform.

Parameters:

Name Type Description Default
feature_categories dict[str, list[Any]]

Dictionary of features and a list of categorical values to encode for each.

required
check_for_missing_categories bool

If set to true, a check is performed to see if all values belong to a category. If an unknown value is found which does not belong to any category, a NoMatchingCategoryError is thrown. To perform this check, the dataframe must be materialised, resulting in a potential performance decrease. Therefore it defaults to false.

False
Source code in src/flowcean/transforms/one_cold.py
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def __init__(
    self,
    feature_categories: dict[str, list[Any]],
    *,
    check_for_missing_categories: bool = False,
) -> None:
    """Initializes the One-Hot transform.

    Args:
        feature_categories: Dictionary of features and a list of
            categorical values to encode for each.
        check_for_missing_categories: If set to true, a check is performed
            to see if all values belong to a category. If an unknown value
            is found which does not belong to any category, a
            NoMatchingCategoryError is thrown. To perform this check, the
            dataframe must be materialised, resulting in a potential
            performance decrease. Therefore it defaults to false.
    """
    self.feature_category_mapping = {
        feature: {f"{feature}_{value}": value for value in values}
        for feature, values in feature_categories.items()
    }
    self.check_for_missing_categories = check_for_missing_categories

apply(data)

Transform data with this one hot transformation.

Transform data with this one hot transformation and return the resulting dataframe.

Parameters:

Name Type Description Default
data LazyFrame

The data to transform.

required

Returns:

Type Description
LazyFrame

The transformed data.

Source code in src/flowcean/transforms/one_cold.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
@override
def apply(
    self,
    data: pl.LazyFrame,
) -> pl.LazyFrame:
    """Transform data with this one hot transformation.

    Transform data with this one hot transformation and return the
    resulting dataframe.

    Args:
        data: The data to transform.

    Returns:
        The transformed data.
    """
    if len(self.feature_category_mapping) == 0:
        raise NoCategoriesError
    for (
        feature,
        category_mappings,
    ) in self.feature_category_mapping.items():
        data = data.with_columns(
            [
                pl.col(feature).ne(value).cast(pl.Int64).alias(name)
                for name, value in category_mappings.items()
            ],
        ).drop(feature)

        # Check only for missing categories if the user has requested it
        if self.check_for_missing_categories and (
            not data.select(
                [
                    pl.col(name).cast(pl.Boolean)
                    for name in category_mappings
                ],
            )  # Get the new crated on-cold feature columns
            .select(
                # Check if all on-cold features are true
                # That's only the case if the category is missing
                pl.all_horizontal(
                    pl.all(),
                ).all(),  # Combine the results for all data entries ...
            )
            .collect(streaming=True)
            # ... and get the final result.
            # If it is false, there is a missing category
            .item(0, 0)
        ):
            raise NoMatchingCategoryError
    return data

from_dataframe(data, features, *, check_for_missing_categories=False) classmethod

Creates a new one-hot transformation based on sample data.

Parameters:

Name Type Description Default
data DataFrame

A dataframe containing sample data for determining the categories of the transform.

required
features Iterable[str]

Name of the features for which the one hot transformation will determine the categories.

required
check_for_missing_categories bool

If set to true, a check is performed to see if all values belong to a category. If an unknown value is found which does not belong to any category, a NoMatchingCategoryError is thrown. To perform this check, the dataframe must be materialised, resulting in a potential performance decrease. Therefore it defaults to false.

False
Source code in src/flowcean/transforms/one_cold.py
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
@classmethod
def from_dataframe(
    cls,
    data: pl.DataFrame,
    features: Iterable[str],
    *,
    check_for_missing_categories: bool = False,
) -> Self:
    """Creates a new one-hot transformation based on sample data.

    Args:
        data: A dataframe containing sample data for determining the
            categories of the transform.
        features: Name of the features for which the one hot transformation
            will determine the categories.
        check_for_missing_categories: If set to true, a check is performed
            to see if all values belong to a category. If an unknown value
            is found which does not belong to any category, a
            NoMatchingCategoryError is thrown. To perform this check, the
            dataframe must be materialised, resulting in a potential
            performance decrease. Therefore it defaults to false.
    """
    # Derive categories from the data frame
    feature_categories: dict[str, list[Any]] = {}
    for feature in features:
        if data.schema[feature].is_float():
            logger.warning(
                (
                    "Feature %s is of type float. Applying a one-cold",
                    "transform to it may produce undesired results.",
                    "Check your datatypes and transforms.",
                ),
                feature,
            )
        feature_categories[feature] = (
            data.select(pl.col(feature).unique()).to_series().to_list()
        )
    return cls(
        feature_categories,
        check_for_missing_categories=check_for_missing_categories,
    )