unravel.cluster_stats.cstats module#
Use cstats
(cs
) from UNRAVEL to validate clusters based on differences in cell/object or label density w/ t-tests.
- Input files:
*`_density_data.csv from ``cstats_validation` (e.g., in each subdir named after the rev_cluster_index.nii.gz file)
- Outputs:
./_valid_clusters_stats/
Note
Organize data in directories for each comparison (e.g., psilocybin > saline, etc.)
This script will loop through all directories in the current working dir and process the data in each subdir.
Each subdir should contain .csv files with the density data for each cluster.
The first 2 groups reflect the main comparison for validation rates.
Clusters are not considered valid if the effect direction does not match the expected direction.
- CSV naming conventions:
Condition: first word before ‘_’ in the file name
Side: last word before .csv (LH or RH)
- Example unilateral inputs in the subdirs:
condition1_sample01_<cell|label>_density_data.csv
condition1_sample02_<cell|label>_density_data.csv
condition2_sample03_<cell|label>_density_data.csv
condition2_sample04_<cell|label>_density_data.csv
- Example bilateral inputs (if any file has _LH.csv or _RH.csv, the command will attempt to pool data):
condition1_sample01_<cell|label>_density_data_LH.csv
condition1_sample01_<cell|label>_density_data_RH.csv
Examples
- Grouping data by condition prefixes:
cstats
–groups psilocybin saline –condition_prefixes saline psilocybin - This will treat all ‘psilocybin*’ conditions as one group and all ‘saline*’ conditions as another - Since there will then effectively be two conditions in this case, they will be compared using a t-test
- Columns in the .csv files:
sample, cluster_ID, <cell_count|label_volume>, cluster_volume, <cell_density|label_density>, …
Usage for t-tests:#
cstats –groups <group1> <group2> -hg <group1|group2> [-cp <condition_prefixes>] [-alt <two-sided|less|greater>] [-pvt <p_value_threshold.txt>] [-v]
Usage for Tukey’s tests:#
cstats –groups <group1> <group2> <group3> <group4> … -hg <group1|group2> [-cp <condition_prefixes>] [-alt <two-sided|less|greater>] [-pvt <p_value_threshold.txt>] [-v]
- unravel.cluster_stats.cstats.condition_selector(df, condition, unique_conditions, condition_column='Conditions')[source]#
Create a condition selector to handle pooling of data in a DataFrame based on specified conditions. This function checks if the ‘condition’ is exactly present in the ‘Conditions’ column or is a prefix of any condition in this column. If the exact condition is found, it selects those rows. If the condition is a prefix (e.g., ‘saline’ matches ‘saline-1’, ‘saline-2’), it selects all rows where the ‘Conditions’ column starts with this prefix. An error is raised if the condition is neither found as an exact match nor as a prefix.
- Parameters:
- Returns:
A boolean Series to select rows based on the condition.
- Return type:
pd.Series
- unravel.cluster_stats.cstats.cluster_validation_data_df(density_col, has_hemisphere, csv_files, groups, data_col, data_col_pooled, condition_prefixes=None)[source]#
Aggregate the data from all .csv files, pool bilateral data if hemispheres are present, optionally pool data by condition, and return the DataFrame.
- Parameters:
density_col (-) – the column name for the density data
has_hemisphere (-) – whether the data files contain hemisphere indicators (e.g., _LH.csv or _RH.csv)
csv_files (-) – a list of .csv files
groups (-) – a list of group names
data_col (-) – the column name for the data (cell_count or label_volume)
data_col_pooled (-) – the column name for the pooled data
- Returns:
- the DataFrame containing the cluster data
Columns: ‘condition’, ‘sample’, ‘cluster_ID’, ‘cell_count’, ‘cluster_volume’, ‘cell_density’
- Return type:
data_df (pd.DataFrame)
- unravel.cluster_stats.cstats.valid_clusters_t_test(df, group1, group2, density_col, alternative='two-sided')[source]#
Perform unpaired t-tests for each cluster in the DataFrame and return the results as a DataFrame.
- Parameters:
df (-) – the DataFrame containing the cluster data - Columns: ‘condition’, ‘sample’, ‘cluster_ID’, ‘cell_count’, ‘cluster_volume’, ‘cell_density’
group1 (-) – the name of the first group
group2 (-) – the name of the second group
density_col (-) – the column name for the density data
alternative (-) – the alternative hypothesis (‘two-sided’, ‘less’, or ‘greater’)
- Returns:
- the DataFrame containing the t-test results
Columns: ‘cluster_ID’, ‘comparison’, ‘higher_mean_group’, ‘p-value’, ‘significance’
- Return type:
stats_df (pd.DataFrame)
- unravel.cluster_stats.cstats.perform_tukey_test(df, groups, density_col)[source]#
Perform Tukey’s HSD test for each cluster in the DataFrame and return the results as a DataFrame
- Parameters:
df (-) – the DataFrame containing the cluster data - Columns: ‘condition’, ‘sample’, ‘cluster_ID’, ‘cell_count’, ‘cluster_volume’, ‘cell_density’
groups (-) – a list of group names
density_col (-) – the column name for the density data
- Returns:
- the DataFrame containing the Tukey’s HSD test results
Columns: ‘cluster_ID’, ‘comparison’, ‘higher_mean_group’, ‘p-value’, ‘significance’
- Return type:
stats_df (pd.DataFrame)