cuisto.compute

compute module, part of cuisto.

Contains actual computation functions.

`get_distribution(df, col, hue, hue_filter, per_commonnorm, binlim, nbins=100)` #

Computes distribution of objects.

A global distribution using only col is computed, then it computes a distribution distinguishing values in the hue column. For the latter, it is possible to use a subset of the data only, based on another column using hue_filter. This another column is determined with hue, if the latter is "hemisphere", then hue_filter is used in the "channel" color and vice-versa. per_commonnorm controls how they are normalized, either as a whole (True) or independantly (False).

Use cases : (1) single-channel, two hemispheres : col=x, hue=hemisphere, hue_filter="", per_commonorm=True. Computes a distribution for each hemisphere, the sum of the area of both is equal to 1. (2) three-channels, one hemisphere : col=x, hue=channel, hue_filter="Ipsi.", per_commonnorm=False. Computes a distribution for each channel only for points in the ipsilateral hemisphere. Each curve will have an area of 1.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`		required
`col`	`str`	Key in `df`, used to compute the distributions.	required
`hue`	`str`	Key in `df`. Criterion for additional distributions.	required
`hue_filter`	`str`	Further filtering for "per" distribution. - hue = channel : value is the name of one of the hemisphere - hue = hemisphere : value can be the name of a channel, a list of such or "all"	required
`per_commonnorm`	`bool`	Use common normalization for all hues (per argument).	required
`binlim`	`list or tuple`	First bin left edge and last bin right edge.	required
`nbins`	`int`	Number of bins. Default is 100.	`100`

Returns:

Name	Type	Description
`df_distribution`	`DataFrame`	DataFrame with `bins`, `distribution`, `count` and their per-hemisphere or per-channel variants.

Source code in cuisto/compute.py

def get_distribution(
    df: pd.DataFrame,
    col: str,
    hue: str,
    hue_filter: dict,
    per_commonnorm: bool,
    binlim: tuple | list,
    nbins=100,
) -> pd.DataFrame:
    """
    Computes distribution of objects.

    A global distribution using only `col` is computed, then it computes a distribution
    distinguishing values in the `hue` column. For the latter, it is possible to use a
    subset of the data only, based on another column using `hue_filter`. This another
    column is determined with `hue`, if the latter is "hemisphere", then `hue_filter` is
    used in the "channel" color and vice-versa.
    `per_commonnorm` controls how they are normalized, either as a whole (True) or
    independantly (False).

    Use cases :
    (1) single-channel, two hemispheres : `col=x`, `hue=hemisphere`, `hue_filter=""`,
    `per_commonorm=True`. Computes a distribution for each hemisphere, the sum of the
    area of both is equal to 1.
    (2) three-channels, one hemisphere : `col=x`, hue=`channel`,
    `hue_filter="Ipsi.", per_commonnorm=False`. Computes a distribution for each channel
    only for points in the ipsilateral hemisphere. Each curve will have an area of 1.

    Parameters
    ----------
    df : pandas.DataFrame
    col : str
        Key in `df`, used to compute the distributions.
    hue : str
        Key in `df`. Criterion for additional distributions.
    hue_filter : str
        Further filtering for "per" distribution.
        - hue = channel : value is the name of one of the hemisphere
        - hue = hemisphere : value can be the name of a channel, a list of such or "all"
    per_commonnorm : bool
        Use common normalization for all hues (per argument).
    binlim : list or tuple
        First bin left edge and last bin right edge.
    nbins : int, optional
        Number of bins. Default is 100.

    Returns
    -------
    df_distribution : pandas.DataFrame
        DataFrame with `bins`, `distribution`, `count` and their per-hemisphere or
        per-channel variants.

    """

    # - Preparation
    bin_edges = np.linspace(*binlim, nbins + 1)  # create bins
    df_distribution = []  # prepare list of distributions

    # - Both hemispheres, all channels
    # get raw count per bins (histogram)
    count, bin_edges = np.histogram(df[col], bin_edges)
    # get normalized count (pdf)
    distribution, _ = np.histogram(df[col], bin_edges, density=True)
    # get bin centers rather than edges to plot them
    bin_centers = bin_edges[:-1] + np.diff(bin_edges) / 2

    # make a DataFrame out of that
    df_distribution.append(
        pd.DataFrame(
            {
                "bins": bin_centers,
                "distribution": distribution,
                "count": count,
                "hemisphere": "both",
                "channel": "all",
                "axis": col,  # keep track of what col. was used
            }
        )
    )

    # - Per additional criterion
    # select data
    df_sub = select_hemisphere_channel(df, hue, hue_filter, False)
    hue_values = df[hue].unique()  # get grouping values
    # total number of datapoints in the subset used for additional distribution
    length_total = len(df_sub)

    for value in hue_values:
        # select part and coordinates
        df_part = df_sub.loc[df_sub[hue] == value, col]

        # get raw count per bins (histogram)
        count, bin_edges = np.histogram(df_part, bin_edges)
        # get normalized count (pdf)
        distribution, _ = np.histogram(df_part, bin_edges, density=True)

        if per_commonnorm:
            # re-normalize so that the sum of areas of all sub-parts is 1
            length_part = len(df_part)  # number of datapoints in that hemisphere
            distribution *= length_part / length_total

        # get bin centers rather than edges to plot them
        bin_centers = bin_edges[:-1] + np.diff(bin_edges) / 2

        # make a DataFrame out of that
        df_distribution.append(
            pd.DataFrame(
                {
                    "bins": bin_centers,
                    "distribution": distribution,
                    "count": count,
                    hue: value,
                    "channel" if hue == "hemisphere" else "hemisphere": hue_filter,
                    "axis": col,  # keep track of what col. was used
                }
            )
        )

    return pd.concat(df_distribution)

`get_regions_metrics(df_annotations, object_type, channel_names, meas_base_name, metrics_names)` #

Derive metrics from meas_base_name.

The measurements columns of df_annotations must be properly formatted, eg : object_type: channel meas_base_name

Derived metrics include : - raw measurement - areal density - relative raw measurement - relative density

Supports objects that are counted (polygons or points) and objects whose length is measured (fibers-like).

Parameters:

Name	Type	Description	Default
`df_annotations`	`DataFrame`	DataFrame with an entry for each brain regions, with columns "Area µm^2", "Name", "hemisphere", and "{object_type: channel} Length µm".	required
`object_type`	`str`	Object type (primary classification).	required
`channel_names`	`dict`	Map between original channel names to something else.	required
`meas_base_name`	`str`	Base measurement name in the input DataFrame used to derive metrics.	required
`metrics_names`	`dict`	Maps hardcoded measurement names to display names.	required

Returns:

Name	Type	Description
`df_regions`	`DataFrame`	DataFrame with brain regions name, area and metrics.

Source code in cuisto/compute.py

def get_regions_metrics(
    df_annotations: pd.DataFrame,
    object_type: str,
    channel_names: dict,
    meas_base_name: str,
    metrics_names: dict,
) -> pd.DataFrame:
    """
    Derive metrics from `meas_base_name`.

    The measurements columns of `df_annotations` must be properly formatted, eg :
    object_type: channel meas_base_name

    Derived metrics include :
    - raw measurement
    - areal density
    - relative raw measurement
    - relative density

    Supports objects that are counted (polygons or points) and objects whose length is
    measured (fibers-like).

    Parameters
    ----------
    df_annotations : pandas.DataFrame
        DataFrame with an entry for each brain regions, with columns "Area µm^2",
        "Name", "hemisphere", and "{object_type: channel} Length µm".
    object_type : str
        Object type (primary classification).
    channel_names : dict
        Map between original channel names to something else.
    meas_base_name : str
        Base measurement name in the input DataFrame used to derive metrics.
    metrics_names : dict
        Maps hardcoded measurement names to display names.

    Returns
    -------
    df_regions : pandas.DataFrame
        DataFrame with brain regions name, area and metrics.

    """
    # get columns names
    cols = df_annotations.columns
    # get columns with fibers lengths
    cols_colors = cols[
        cols.str.startswith(object_type) & cols.str.endswith(meas_base_name)
    ]
    # select relevant data
    cols_to_select = pd.Index(["Name", "hemisphere", "Area µm^2"]).append(cols_colors)
    # sum lengths and areas of each brain regions
    df_regions = (
        df_annotations[cols_to_select]
        .groupby(["Name", "hemisphere"])
        .sum()
        .reset_index()
    )

    # get measurement for both hemispheres (sum)
    df_both = df_annotations[cols_to_select].groupby(["Name"]).sum().reset_index()
    df_both["hemisphere"] = "both"
    df_regions = (
        pd.concat([df_regions, df_both], ignore_index=True)
        .sort_values(by="Name")
        .reset_index()
        .drop(columns="index")
    )

    # rename measurement columns to lower case
    df_regions = df_regions.rename(
        columns={
            k: k.replace(meas_base_name, meas_base_name.lower()) for k in cols_colors
        }
    )

    # update names
    meas_base_name = meas_base_name.lower()
    cols = df_regions.columns
    cols_colors = cols[
        cols.str.startswith(object_type) & cols.str.endswith(meas_base_name)
    ]

    # convert area in mm^2
    df_regions["Area mm^2"] = df_regions["Area µm^2"] / 1e6

    # prepare metrics
    if meas_base_name.endswith("µm"):
        # fibers : convert to mm
        cols_to_convert = pd.Index([col for col in cols_colors if "µm" in col])
        df_regions[cols_to_convert.str.replace("µm", "mm")] = (
            df_regions[cols_to_convert] / 1000
        )
        metrics = [meas_base_name, meas_base_name.replace("µm", "mm")]
    else:
        # objects : count
        metrics = [meas_base_name]

    # density = measurement / area
    metric = metrics_names["density µm^-2"]
    df_regions[cols_colors.str.replace(meas_base_name, metric)] = df_regions[
        cols_colors
    ].divide(df_regions["Area µm^2"], axis=0)
    metrics.append(metric)
    metric = metrics_names["density mm^-2"]
    df_regions[cols_colors.str.replace(meas_base_name, metric)] = df_regions[
        cols_colors
    ].divide(df_regions["Area mm^2"], axis=0)
    metrics.append(metric)

    # coverage index = measurement² / area
    metric = metrics_names["coverage index"]
    df_regions[cols_colors.str.replace(meas_base_name, metric)] = (
        df_regions[cols_colors].pow(2).divide(df_regions["Area µm^2"], axis=0)
    )
    metrics.append(metric)

    # prepare relative metrics columns
    metric = metrics_names["relative measurement"]
    cols_rel_meas = cols_colors.str.replace(meas_base_name, metric)
    df_regions[cols_rel_meas] = np.nan
    metrics.append(metric)
    metric = metrics_names["relative density"]
    cols_dens = cols_colors.str.replace(meas_base_name, metrics_names["density mm^-2"])
    cols_rel_dens = cols_colors.str.replace(meas_base_name, metric)
    df_regions[cols_rel_dens] = np.nan
    metrics.append(metric)
    # relative metrics should be defined within each hemispheres (left, right, both)
    for hemisphere in df_regions["hemisphere"].unique():
        row_indexer = df_regions["hemisphere"] == hemisphere

        # relative measurement = measurement / total measurement
        df_regions.loc[row_indexer, cols_rel_meas] = (
            df_regions.loc[row_indexer, cols_colors]
            .divide(df_regions.loc[row_indexer, cols_colors].sum())
            .to_numpy()
        )

        # relative density = density / total density
        df_regions.loc[row_indexer, cols_rel_dens] = (
            df_regions.loc[
                row_indexer,
                cols_dens,
            ]
            .divide(df_regions.loc[row_indexer, cols_dens].sum())
            .to_numpy()
        )

    # collect channel names
    channels = (
        cols_colors.str.replace(object_type + ": ", "")
        .str.replace(" " + meas_base_name, "")
        .values.tolist()
    )
    # collect measurements columns names
    cols_metrics = df_regions.columns.difference(
        pd.Index(["Name", "hemisphere", "Area µm^2", "Area mm^2"])
    )
    for metric in metrics:
        cols_to_cat = [f"{object_type}: {cn} {metric}" for cn in channels]
        # make sure it's part of available metrics
        if not set(cols_to_cat) <= set(cols_metrics):
            raise ValueError(f"{cols_to_cat} not in DataFrame.")
        # group all colors in the same colors
        df_regions[metric] = df_regions[cols_to_cat].values.tolist()
        # remove original data
        df_regions = df_regions.drop(columns=cols_to_cat)

    # add a color tag, given their names in the configuration file
    df_regions["channel"] = len(df_regions) * [[channel_names[k] for k in channels]]
    metrics.append("channel")

    # explode the dataframe so that each color has an entry
    df_regions = df_regions.explode(metrics)

    return df_regions

`normalize_starter_cells(df, cols, animal, info_file, channel_names)` #

Normalize data by the number of starter cells.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Contains the data to be normalized.	required
`cols`	`list - like`	Columns to divide by the number of starter cells.	required
`animal`	`str`	Animal ID to parse the number of starter cells.	required
`info_file`	`str`	Full path to the TOML file with informations.	required
`channel_names`	`dict`	Map between original channel names to something else.	required

Returns:

Type	Description
`DataFrame`	Same `df` with normalized count.

Source code in cuisto/compute.py

def normalize_starter_cells(
    df: pd.DataFrame, cols: list[str], animal: str, info_file: str, channel_names: dict
) -> pd.DataFrame:
    """
    Normalize data by the number of starter cells.

    Parameters
    ----------
    df : pd.DataFrame
        Contains the data to be normalized.
    cols : list-like
        Columns to divide by the number of starter cells.
    animal : str
        Animal ID to parse the number of starter cells.
    info_file : str
        Full path to the TOML file with informations.
    channel_names : dict
        Map between original channel names to something else.

    Returns
    -------
    pd.DataFrame
        Same `df` with normalized count.

    """
    for channel in df["channel"].unique():
        # inverse mapping channel colors : names
        reverse_channels = {v: k for k, v in channel_names.items()}
        nstarters = get_starter_cells(animal, reverse_channels[channel], info_file)

        for col in cols:
            df.loc[df["channel"] == channel, col] = (
                df.loc[df["channel"] == channel, col] / nstarters
            )

    return df

cuisto.compute

get_distribution(df, col, hue, hue_filter, per_commonnorm, binlim, nbins=100) #

get_regions_metrics(df_annotations, object_type, channel_names, meas_base_name, metrics_names) #

normalize_starter_cells(df, cols, animal, info_file, channel_names) #

`get_distribution(df, col, hue, hue_filter, per_commonnorm, binlim, nbins=100)` #

`get_regions_metrics(df_annotations, object_type, channel_names, meas_base_name, metrics_names)` #

`normalize_starter_cells(df, cols, animal, info_file, channel_names)` #