Skip to content

Data Splits

In most data science workflows, it's common to split data into different subsets for analysis and comparison. In support of this, DataInterface subclasses allow you to specify and split your data based on specific logic that is provided to a DataSplit.

Split types

Column Name and Value

  • Split data based on a column value.
  • Supports inequality signs.
  • Works with Pandas and Polars DataFrames.

Example

import polars as pl
from opsml import PolarsData, DataSplit, CardInfo

info = CardInfo(name="data", repository="mlops", contact="user@mlops.com")

df = pl.DataFrame(
    {
        "foo": [1, 2, 3, 4, 5, 6],
        "bar": ["a", "b", "c", "d", "e", "f"],
        "y": [1, 2, 3, 4, 5, 6],
    }
)

interface = PolarsData(
    info=info,
    data=df,
    data_splits = [
        DataSplit(label="train", column_name="foo", column_value=6, inequality="<"),
        DataSplit(label="test", column_name="foo", column_value=6)
    ]

)

splits = interface.split_data()
assert splits["train"].X.shape[0] == 5
assert splits["test"].X.shape[0] == 1

Indices

  • Split data based on pre-defined indices
  • Works with NDArray, pyarrow.Table, pandas.DataFrame and polars.DataFrame
import numpy as np
from opsml import NumpyData, DataSplit, CardInfo

info = CardInfo(name="data", repository="mlops", contact="user@mlops.com")

data = np.random.rand(10, 10)

interface = NumpyData(
    info=info,
    data=data,
    data_splits = [
        DataSplit(label="train", indices=[0,1,5])
    ]

)

splits = interface.split_data()
assert splits["train"].X.shape[0] == 3

Start and Stop Slicing

  • Split data based on row slices with a start and stop index
  • Works with NDArray, pyarrow.Table, pandas.DataFrame and polars.DataFrame
import numpy as np
from opsml import NumpyData, DataSplit, CardInfo

info = CardInfo(name="data", repository="mlops", contact="user@mlops.com")

data = np.random.rand(10, 10)

interface = NumpyData(
    info=info,
    data=data,
    data_splits = [
        DataSplit(label="train", start=0, stop=3)
    ]

)

splits = interface.split_data()
assert splits["train"].X.shape[0] == 3

opsml.DataSplit

Bases: BaseModel

Source code in opsml/data/splitter.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
class DataSplit(BaseModel):
    model_config = ConfigDict(arbitrary_types_allowed=True)

    label: str
    column_name: Optional[str] = None
    column_value: Optional[Union[str, float, int, pd.Timestamp]] = None
    inequality: Optional[str] = None
    start: Optional[int] = None
    stop: Optional[int] = None
    indices: Optional[List[int]] = None

    @field_validator("indices", mode="before")
    @classmethod
    def convert_to_list(cls, value: Optional[List[int]]) -> Optional[List[int]]:
        """Pre to convert indices to list if not None"""

        if value is not None and not isinstance(value, list):
            value = list(value)

        return value

    @field_validator("inequality", mode="before")
    @classmethod
    def trim_whitespace(cls, value: str) -> str:
        """Trims whitespace from inequality signs"""

        if value is not None:
            value = value.strip()

        return value

convert_to_list(value) classmethod

Pre to convert indices to list if not None

Source code in opsml/data/splitter.py
34
35
36
37
38
39
40
41
42
@field_validator("indices", mode="before")
@classmethod
def convert_to_list(cls, value: Optional[List[int]]) -> Optional[List[int]]:
    """Pre to convert indices to list if not None"""

    if value is not None and not isinstance(value, list):
        value = list(value)

    return value

trim_whitespace(value) classmethod

Trims whitespace from inequality signs

Source code in opsml/data/splitter.py
44
45
46
47
48
49
50
51
52
@field_validator("inequality", mode="before")
@classmethod
def trim_whitespace(cls, value: str) -> str:
    """Trims whitespace from inequality signs"""

    if value is not None:
        value = value.strip()

    return value