unravel.cluster_stats.cstats module#

Use cstats (cs) from UNRAVEL to validate clusters based on differences in validation metrics across groups.

Input files:
  • *_data.csv from cstats_validation (e.g., cell_density_data.csv, label_density_data.csv, mean_in_cluster_data.csv, or mean_in_seg_in_cluster_data.csv)

Outputs:
  • ./_valid_clusters_stats/

Note

  • Organize data in directories for each comparison (e.g., psilocybin > saline, etc.)

  • This script loops through all subdirectories in the current working directory.

  • Each subdir should contain CSV files with cluster-level validation metric data.

  • The first 2 groups reflect the main comparison for validation rates.

  • If --higher_group is provided, clusters are not considered valid if the effect direction does not match the expected direction.

  • If --higher_group is omitted, validation is non-directional and significant clusters are kept regardless of effect direction.

Columns in .csv files from cstats_validation:

sample, cluster_ID, metric, value, value_type, support, support_type, aggregation_method, cluster_volume, …

Columns in the older .csv files (still works):

sample, cluster_ID, <cell_count|label_volume>, cluster_volume, <cell_density|label_density>, …

CSV naming conventions:
  • Condition: first word before ‘_’ in the file name

  • Side: last word before .csv (LH or RH)

Example unilateral inputs in the subdirs:
  • condition1_sample01_<cell|label>_density_data.csv

  • condition1_sample02_<cell|label>_density_data.csv

  • condition2_sample03_<cell|label>_density_data.csv

  • condition2_sample04_<cell|label>_density_data.csv

Example bilateral inputs (if any file has _LH.csv or _RH.csv, the command will attempt to pool LH/RH data per sample when both sides are present):
  • condition1_sample01_<cell|label>_density_data_LH.csv

  • condition1_sample01_<cell|label>_density_data_RH.csv

Examples

  • Grouping data by condition prefixes:

    cstats –groups psilocybin saline –condition_prefixes saline psilocybin - This will treat all ‘psilocybin*’ conditions as one group and all ‘saline*’ conditions as another - Since there will then effectively be two conditions in this case, they will be compared using a t-test

Usage for t-tests:#

cstats –groups <group1> <group2> [-hg <group> ] [-dp <dir_pattern>] [-cp <condition_prefixes>] [-alt <two-sided|less|greater>] [-pvt <p_value_threshold.txt>] [-v]

Usage for Tukey’s tests:#

cstats –groups <group1> <group2> <group3> … [-hg <group>] [-dp <dir_pattern>] [-cp <condition_prefixes>] [-alt <two-sided|less|greater>] [-pvt <p_value_threshold.txt>] [-v]

unravel.cluster_stats.cstats.parse_args()[source]#
unravel.cluster_stats.cstats.get_matching_input_csvs(input_dir, groups)[source]#
unravel.cluster_stats.cstats.detect_metric_schema(first_df)[source]#

Detect whether input CSVs use the new generic metric schema or the older density schema.

Returns:

metric_name value_col support_col support_type aggregation_method value_type schema_type

Return type:

dict with

unravel.cluster_stats.cstats.load_metric_csv_for_stats(file, schema)[source]#

Load one metric CSV and normalize it to a standard schema for stats.

Returns DataFrame with columns:

condition, sample, side, cluster_ID, value, support, cluster_volume

unravel.cluster_stats.cstats.condition_selector(df, condition, unique_conditions, condition_column='condition')[source]#

Create a condition selector to handle pooling of data in a DataFrame based on specified conditions. This function checks if the ‘condition’ is exactly present in the ‘condition’ column or is a prefix of any condition in this column. If the exact condition is found, it selects those rows. If the condition is a prefix (e.g., ‘saline’ matches ‘saline-1’, ‘saline-2’), it selects all rows where the ‘condition’ column starts with this prefix. An error is raised if the condition is neither found as an exact match nor as a prefix.

Parameters:
  • df (pd.DataFrame) – DataFrame whose ‘condition’ column contains the conditions of interest.

  • condition (str) – The condition or prefix of interest.

  • unique_conditions (list) – List of unique conditions in the ‘condition’ column to validate against.

Returns:

A boolean Series to select rows based on the condition.

Return type:

pd.Series

unravel.cluster_stats.cstats.pool_sample_metric(sample_df, metric_name, aggregation_method)[source]#

Pool LH/RH rows for one sample+condition+cluster if both sides are present.

If only one side is present, return that side unchanged.

unravel.cluster_stats.cstats.cluster_validation_data_df(metric_name, value_col, support_col, support_type, aggregation_method, has_hemisphere, csv_files, groups, condition_prefixes=None)[source]#

Aggregate metric data from all CSVs, optionally pool bilateral data per sample, optionally group conditions by prefix, and return a standardized DataFrame.

Returns DataFrame with columns:

condition, sample, side, cluster_ID, support, cluster_volume, value

unravel.cluster_stats.cstats.valid_clusters_t_test(df, group1, group2, value_col='value', alternative='two-sided')[source]#

Perform unpaired t-tests for each cluster in the DataFrame and return the results as a DataFrame.

Parameters:
  • df (-) – the DataFrame containing the cluster data - Columns: ‘condition’, ‘sample’, ‘cluster_ID’, value_col, support, cluster_volume

  • group1 (-) – the name of the first group

  • group2 (-) – the name of the second group

  • value_col (-) – the column name for the metric values to compare

  • alternative (-) – the alternative hypothesis (‘two-sided’, ‘less’, or ‘greater’) for the t-test

Returns:

the DataFrame containing the t-test results
  • Columns: ‘cluster_ID’, ‘comparison’, ‘higher_mean_group’, ‘p-value’, ‘significance’

Return type:

  • stats_df (pd.DataFrame)

unravel.cluster_stats.cstats.perform_tukey_test(df, value_col='value')[source]#

Perform Tukey’s HSD test for each cluster in the DataFrame and return the results as a DataFrame.

unravel.cluster_stats.cstats.main()[source]#