unravel.cluster_stats.cstats module#
Use cstats (cs) from UNRAVEL to validate clusters based on differences in validation metrics across groups.
- Input files:
*_data.csv from
cstats_validation(e.g., cell_density_data.csv, label_density_data.csv, mean_in_cluster_data.csv, or mean_in_seg_in_cluster_data.csv)
- Outputs:
./_valid_clusters_stats/
Note
Organize data in directories for each comparison (e.g., psilocybin > saline, etc.)
This script loops through all subdirectories in the current working directory.
Each subdir should contain CSV files with cluster-level validation metric data.
The first 2 groups reflect the main comparison for validation rates.
If
--higher_groupis provided, clusters are not considered valid if the effect direction does not match the expected direction.If
--higher_groupis omitted, validation is non-directional and significant clusters are kept regardless of effect direction.
- Columns in .csv files from
cstats_validation: sample, cluster_ID, metric, value, value_type, support, support_type, aggregation_method, cluster_volume, …
- Columns in the older .csv files (still works):
sample, cluster_ID, <cell_count|label_volume>, cluster_volume, <cell_density|label_density>, …
- CSV naming conventions:
Condition: first word before ‘_’ in the file name
Side: last word before .csv (LH or RH)
- Example unilateral inputs in the subdirs:
condition1_sample01_<cell|label>_density_data.csv
condition1_sample02_<cell|label>_density_data.csv
condition2_sample03_<cell|label>_density_data.csv
condition2_sample04_<cell|label>_density_data.csv
- Example bilateral inputs (if any file has _LH.csv or _RH.csv, the command will attempt to pool LH/RH data per sample when both sides are present):
condition1_sample01_<cell|label>_density_data_LH.csv
condition1_sample01_<cell|label>_density_data_RH.csv
Examples
- Grouping data by condition prefixes:
cstats–groups psilocybin saline –condition_prefixes saline psilocybin - This will treat all ‘psilocybin*’ conditions as one group and all ‘saline*’ conditions as another - Since there will then effectively be two conditions in this case, they will be compared using a t-test
Usage for t-tests:#
cstats –groups <group1> <group2> [-hg <group> ] [-dp <dir_pattern>] [-cp <condition_prefixes>] [-alt <two-sided|less|greater>] [-pvt <p_value_threshold.txt>] [-v]
Usage for Tukey’s tests:#
cstats –groups <group1> <group2> <group3> … [-hg <group>] [-dp <dir_pattern>] [-cp <condition_prefixes>] [-alt <two-sided|less|greater>] [-pvt <p_value_threshold.txt>] [-v]
- unravel.cluster_stats.cstats.detect_metric_schema(first_df)[source]#
Detect whether input CSVs use the new generic metric schema or the older density schema.
- Returns:
metric_name value_col support_col support_type aggregation_method value_type schema_type
- Return type:
dict with
- unravel.cluster_stats.cstats.load_metric_csv_for_stats(file, schema)[source]#
Load one metric CSV and normalize it to a standard schema for stats.
- Returns DataFrame with columns:
condition, sample, side, cluster_ID, value, support, cluster_volume
- unravel.cluster_stats.cstats.condition_selector(df, condition, unique_conditions, condition_column='condition')[source]#
Create a condition selector to handle pooling of data in a DataFrame based on specified conditions. This function checks if the ‘condition’ is exactly present in the ‘condition’ column or is a prefix of any condition in this column. If the exact condition is found, it selects those rows. If the condition is a prefix (e.g., ‘saline’ matches ‘saline-1’, ‘saline-2’), it selects all rows where the ‘condition’ column starts with this prefix. An error is raised if the condition is neither found as an exact match nor as a prefix.
- Parameters:
- Returns:
A boolean Series to select rows based on the condition.
- Return type:
pd.Series
- unravel.cluster_stats.cstats.pool_sample_metric(sample_df, metric_name, aggregation_method)[source]#
Pool LH/RH rows for one sample+condition+cluster if both sides are present.
If only one side is present, return that side unchanged.
- unravel.cluster_stats.cstats.cluster_validation_data_df(metric_name, value_col, support_col, support_type, aggregation_method, has_hemisphere, csv_files, groups, condition_prefixes=None)[source]#
Aggregate metric data from all CSVs, optionally pool bilateral data per sample, optionally group conditions by prefix, and return a standardized DataFrame.
- Returns DataFrame with columns:
condition, sample, side, cluster_ID, support, cluster_volume, value
- unravel.cluster_stats.cstats.valid_clusters_t_test(df, group1, group2, value_col='value', alternative='two-sided')[source]#
Perform unpaired t-tests for each cluster in the DataFrame and return the results as a DataFrame.
- Parameters:
df (-) – the DataFrame containing the cluster data - Columns: ‘condition’, ‘sample’, ‘cluster_ID’, value_col, support, cluster_volume
group1 (-) – the name of the first group
group2 (-) – the name of the second group
value_col (-) – the column name for the metric values to compare
alternative (-) – the alternative hypothesis (‘two-sided’, ‘less’, or ‘greater’) for the t-test
- Returns:
- the DataFrame containing the t-test results
Columns: ‘cluster_ID’, ‘comparison’, ‘higher_mean_group’, ‘p-value’, ‘significance’
- Return type:
stats_df (pd.DataFrame)