Skip to content
This repository was archived by the owner on Oct 19, 2025. It is now read-only.
This repository was archived by the owner on Oct 19, 2025. It is now read-only.

Function to identify variable that can be best predicted from a set of base variables #38

@MaxGhenis

Description

@MaxGhenis

This would help for defining the sequence of variables to impute or synthesize. Something like this would fit well in other functions:

def most_predictable(df, base_cols, candidate_cols, algorithm):
    """ Identifies the most predictable column from a set of base columns.
    
    Args:
        df: DataFrame with base and candidate columns.
        base_cols: List of column names to predict from.
        candidate_cols: List of column names to compare on predictability given base_cols.
        algorithm: Algorithm for determining predictability.

    Returns:
        Column name from candidate_cols which is most predictable from base_cols.
    """

This could be done with something like correlations, or algorithms like random forests (after standardizing data, and the standardization technique might be another arg).

cc @rickecon, per our chat if you can take a stab at this that'd be awesome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions