so4gp package¶
- class DataGP(data_source, min_sup=0.5, eq=False)[source]¶
Bases:
object- Parameters:
data_source (pd.DataFrame | str) – [required] a data source, it can either be a ‘file in csv format’ or a ‘Pandas DataFrame’
min_sup (float) – [optional] minimum support threshold, the default is 0.5
eq (bool) – [optional] encode equal values as gradual, the default is False
- add_gradual_pattern(pattern)[source]¶
Adds a gradual pattern to the list of gradual patterns.
- Parameters:
pattern – A gradual pattern
- Return type:
None
- classmethod analyze_gps(data_src, min_sup, est_gps, approach='bfs')[source]¶
For each estimated GP, computes its true support using the GRAANK approach and returns the statistics (% error, and standard deviation).
>>> import so4gp as sgp >>> import pandas >>> dummy_data = [[30, 3, 1, 10], [35, 2, 2, 8], [40, 4, 2, 7], [50, 1, 1, 6], [52, 7, 1, 2]] >>> columns = ['Age', 'Salary', 'Cars', 'Expenses'] >>> dummy_df = pandas.DataFrame(dummy_data, columns=['Age', 'Salary', 'Cars', 'Expenses']) >>> >>> estimated_gps = list() >>> temp_gp = sgp.GP() >>> for gi_str in ['0+', '1-']: >>> temp_gp.add_gradual_item(sgp.GI.from_string(gi_str)) >>> temp_gp.support = 0.5 >>> estimated_gps.append(temp_gp) >>> temp_gp = sgp.GP() >>> for gi_str in ['1+', '3-', '0+']: >>> temp_gp.add_gradual_item(sgp.GI.from_string(gi_str)) >>> temp_gp.support = 0.48 >>> estimated_gps.append(temp_gp) >>> res = sgp.analyze_gps(dummy_df, min_sup=0.4, est_gps=estimated_gps, approach='bfs') >>> print(res) Gradual Pattern Estimated Support True Support Percentage Error Standard Deviation ['0+', '1-'] 0.5 0.4 25.0% 0.071 ['1+', '3-', '0+'] 0.48 0.6 -20.0% 0.085
- Parameters:
data_src (DataFrame | str) – Data set file
min_sup (float) – Minimum support (set by user)
est_gps (list[GP]) – Estimated GPs
approach (str) – ‘Bfs’ (default) or ‘dfs’
- Returns:
Tabulated results
- Return type:
str
- property attr_cols: ndarray¶
- property attr_size: int¶
- static clean_data(df)[source]¶
Cleans a data-frame (i.e., missing values, outliers) before extraction of GPs
- Parameters:
df (pd.DataFrame) – data-frame
- Returns:
list (column titles), numpy (cleaned data)
- Return type:
tuple[list, ndarray]
- property col_count: int¶
- property data: ndarray¶
- property display_patterns: list¶
- property display_patterns_as_df: DataFrame¶
- fit_bitmap(attr_data=None)[source]¶
Generates bitmaps for columns with numeric objects. It stores the bitmaps in attribute valid_bins (those bitmaps whose computed support values are greater or equal to the minimum support threshold value).
- Parameters:
attr_data (np.ndarray | None) – Stepped attribute objects
- Returns:
void
- Return type:
None
- fit_warpingset()[source]¶
Generates transaction ids (tids) for each column/feature with numeric objects. It stores the tids in attribute valid_tids (those tids whose computed support values are greater or equal to the minimum support threshold value).
The method decomposes the pairwise matrix of a gradual item/pattern into a warping set. Attributes that have strong correlation will produce a warping set with dense zigzag patterns when plotted as a graph. Those with weak correlation will produce a warping set with sparse zigzag patterns.
- Return type:
None
- static gen_gradual_warping_set(pairwise_mat, as_array=False)[source]¶
A method that decomposes the pairwise matrix of a gradual item/pattern into a warping set. Attributes that have strong correlation will produce a warping set with dense zigzag patterns when plotted as a graph. Those with weak correlation will produce a warping set with sparse zigzag patterns.
- Parameters:
pairwise_mat (ndarray) – The pairwise matrix of a gradual item/pattern.
as_array (bool) – If True, returns the warping path as a numpy array else as a list of tuples.
- Returns:
A list array of the warping path (as an edge list).
- Return type:
list[tuple[int, int]] | ndarray
- generate_output_files(alg_data, target_col=None, save_to_file=True)[source]¶
Generates output of results (as files) for the GP mining algorithm.
- Parameters:
alg_data (dict) – Dictionary of algorithm parameters.
target_col (int) – Index of the target column.
save_to_file (bool) – If True, saves the output to files.
- property gradual_patterns: list | None¶
- static read(data_src)[source]¶
Reads all the contents of a file (in CSV format) or a data-frame. Checks if its columns have numeric values. It separates its column headers (titles) from the objects.
- Parameters:
data_src (pd.DataFrame | str) – A data source, it can either be a ‘file in csv format’ or a ‘Pandas DataFrame’
- Returns:
The title, column objects
- Return type:
tuple[list, ndarray]
- remove_subsets(gi_arr, gradual_patterns=None)[source]¶
Remove subset GPs from the list.
- Parameters:
gi_arr (set) – Gradual items in an array
gradual_patterns (list[GP] | None) – List of gradual patterns (if None, use the object’s GPs)
- Returns:
List of GPs
- Return type:
None
- property row_count: int¶
- static test_time(date_str)[source]¶
Tests if a str represents a date-time variable.
- Parameters:
date_str (str) – A string
- Returns:
bool (True if it is a date-time variable, False otherwise)
- Return type:
None | tuple[bool, float] | tuple[bool, bool]
- property thd_supp: float¶
- property time_cols: ndarray¶
- property titles: list¶
- property valid_bins: dict | None¶
- property warping_set: dict[str, list] | None¶
- class GI(attr_col, symbol)[source]¶
Bases:
object- Parameters:
attr_col (int) – Column index
symbol (str) – Variation symbol either “+” or “-”
- property as_tuple: tuple[int, str]¶
The Gradual Item (GI) in tuple format
- property attribute_col: int¶
The column index of a GI
- classmethod from_string(gi_str)[source]¶
Creates a GI from a string like ‘1+’, ‘12-’, or ‘125+’
- Parameters:
gi_str (str)
- Return type:
- static parse_gi(gi_str)[source]¶
Converts a stringified GI into normal GI. The accepted format is ‘1_neg’ or 1_pos’.
- Parameters:
gi_str (str) – A stringified GI
- Returns:
GI
- Return type:
- static swap_gi_symbol(gi_obj)[source]¶
Inverts a GI symbol to the opposite variation (i.e., from - to +; or, from + to -) :return: inverted GI object
- property symbol: str¶
The variation symbol of a GI
- class GP[source]¶
Bases:
objectGP (Gradual Pattern). A class that is used to create GP objects. A GP object is a set of gradual items (GI), and its quality is measured by its computed support value. For example, given a data set with 3 columns (age, salary, cars) and 10 objects. A GP may take the form: {age+, salary-} with a support of 0.8. This implies that 8 out of 10 objects have the values of column age ‘increasing’ and column ‘salary’ decreasing.
>>> import so4gp as sgp >>> gradual_pattern = sgp.GP() >>> gradual_pattern.add_gradual_item(sgp.GI(0, "+")) >>> gradual_pattern.add_gradual_item(sgp.GI(1, "-")) >>> gradual_pattern.support = 0.5 >>> print(f"{gradual_pattern.to_string()}: {gradual_pattern.support}")
- add_gradual_item(item)[source]¶
Adds a gradual item (GI) into the gradual pattern (GP) :param item: gradual item
- Returns:
True if gradual item is added, None otherwise
- Parameters:
item (GI)
- Return type:
bool
- property as_swapped_set: set[str]¶
{‘1-’, ‘2+’}
- Type:
Returns the gradual pattern (GP) as a set of strings
- property avg_deviation_from_diagonal: float¶
- check_am(gp_list, subset=True)[source]¶
Anti-monotonicity check. Checks if a GP is a subset or superset of an already existing GP
- Parameters:
gp_list (list[GP] | None) – A list of existing GPs
subset (bool) – A check if it is a subset
- Returns:
True if superset/subset, False otherwise
- Return type:
bool
- compute_descriptors(warping_set, obj_count)[source]¶
Computes gradual warping set (GWS) descriptors for a given gradual pattern.
The descriptors are defined as follows:
- Density (ρ_g):
Proportion of concordant index pairs relative to all possible pairs ρ_g = |W_g| / C(n, 2)
- Average Deviation from Diagonal (μ_g):
Mean absolute distance |i - j| across all pairs in W_g.
- Rank Dispersion (σ_g):
Standard deviation of |i - j|, capturing variability of index distances across all pairs in W_g.
- Graph Connectivity (κ_g):
Number of connected components when W_g is interpreted as an undirected graph.
- Singularity Score (S_g):
Measures concentration of index participation (node degree skewness). High values indicate dominance of certain indices.
- Path-like behavior (DTW-like) is approximated when:
κ_g = 1, S_g is low, and σ_g is smooth (low variance).
- Parameters:
warping_set (ndarray | None) – np.ndarray of shape (k, 2), containing index pairs (i, j)
obj_count (int) – Total number of objects (n)
- Returns:
True if descriptors are computed successfully, False otherwise
- Return type:
bool
- contains_attr(gi)[source]¶
Checks if any gradual item (GI) in the gradual pattern (GP) is composed of the column :param gi: gradual item :type gi: GI
- Returns:
True if a column exists, False otherwise
- Parameters:
gi (GI)
- Return type:
bool
- decompose()[source]¶
Breaks down all the gradual items (GIs) in the gradual pattern into columns and variation symbols and returns them as separate variables. For instance, a GP {“1+”, “3-”} will be returned as [1, 3], [1, -1]: where [1, 3] is the list of attributes/features and [1, -1] are their corresponding gradual variations (1 -> ‘+’ and 1- -> ‘-‘).
- Returns:
Separate columns and variation symbols
- Return type:
tuple[list[int], list[str]]
- property density: float¶
- get_computed_descriptors(descriptor_title)[source]¶
Returns the computed descriptors of the gradual pattern (GP)
- Parameters:
descriptor_title – If True, returns a dictionary with column names as keys and descriptors as values
- Returns:
List of descriptors
- Return type:
list[str] | list[dict]
- property graph_connectivity: int¶
- is_duplicate(valid_gps, invalid_gps=None)[source]¶
Checks if a pattern is in the list of winner GPs or loser GPs
- static perform_and(bin_data_1, bin_data_2, dim)[source]¶
Perform logical AND operation on two bitmaps.
- Parameters:
bin_data_1 (PairwiseMatrix | None) – Bitmap 1
bin_data_2 (PairwiseMatrix | None) – bitmap 2
dim (int) – dimension of the bitmaps
- Return type:
- print(columns, descriptor_title=False)[source]¶
A method that returns patterns with actual column names
- Parameters:
columns (list[str]) – Column names
descriptor_title (bool) – If True, returns a dictionary with column names as keys and descriptors as values
- Returns:
GP with actual column names
- Return type:
tuple[str, list[str] | list[dict]]
- property rank_dispersion: float¶
- property singularity_score: float¶
- property support: float¶
- static swap_gp_symbols(gp_obj)[source]¶
Swaps the variation symbols of all the gradual items (GIs) in a gradual pattern (GP)
- validate_graank(d_gp)[source]¶
Validates a candidate gradual pattern (GP) based on support computation. A GP is invalid if its support value is less than the minimum support threshold set by the user. It uses a breath-first approach to compute support.
- Parameters:
d_gp (so4gp.DataGP # noinspection PyTypeChecker) – Data_GP object
- Returns:
A valid GP or an empty GP
- Return type:
- validate_tree(d_gp)[source]¶
Validates a candidate gradual pattern (GP) based on support computation. A GP is invalid if its support value is less than the minimum support threshold set by the user. It applies a depth-first (FP-Growth) approach to compute support.
- Parameters:
d_gp (so4gp.DataGP # noinspection PyTypeChecker) – Data_GP object
- Returns:
A valid GP or an empty GP
- class PairwiseMatrix(bin_mat: numpy.ndarray, support: float)[source]¶
Bases:
object- Parameters:
bin_mat (ndarray)
support (float)
- bin_mat: ndarray¶
- support: float¶
- class TGP[source]¶
Bases:
GPA class that inherits an existing GP class to create Temporal GP objects. A TGP is a gradual pattern with a time-delay. It has a target gradual item (which is created from a user-defined attribute), and it is used as the anchor for mining patterns from a dataset. The class has the following attributes:
target_gradual_item: the gradual item on which the pattern is based.
temporal_gradual_items: gradual items which occur after specific time delays.
>>> import so4gp as sgp >>> t_gp = sgp.TGP() >>> t_gp.target_gradual_item = sgp.GI(1, "+") >>> t_gp.add_temporal_gradual_item(sgp.GI(2, "-"), sgp.TimeDelay(7200, 0.8)) >>> t_gp.to_string()
- class TemporalGI(gradual_item: so4gp.gradual_patterns.GI, time_delay: so4gp.gradual_patterns.TimeDelay)[source]¶
Bases:
object
- add_temporal_gradual_item(item, time_delay)[source]¶
Adds a fuzzy temporal gradual item (fTGI) into the fuzzy temporal gradual pattern (fTGP) :param item: gradual item :type item: so4gp.GI
- is_similar_to(ftgp)[source]¶
Checks if two fuzzy temporal gradual patterns are similar.
- Parameters:
ftgp – Fuzzy temporal gradual pattern to compare with.
- Returns:
True if the patterns are similar, False otherwise.
- Return type:
bool
- print(columns, descriptor_title=False)[source]¶
A method that returns a fuzzy temporal gradual pattern (TGP) with actual column names
- Parameters:
columns (list[str]) – Column names
descriptor_title (bool) – If True, prints the descriptor title
- Returns:
TGP with actual column names
- Return type:
tuple[str, list[str] | list[dict]]
- property temporal_gradual_items: list[TemporalGI]¶
- class TimeDelay(tstamp=0, supp=0)[source]¶
Bases:
object- Parameters:
tstamp (Float) – The time-delay value as a timestamp.
supp (Float) – The true value of the time-delay value.
- property formatted_time: dict¶
- property sign: str¶
- property support: float¶
- property timestamp: float¶
- to_string()[source]¶
Returns formated time-delay as a string.
- Returns:
The time-delay as a string.
- Return type:
str
- property valid: bool¶
- analyze_gps(data_src, min_sup, est_gps, approach='bfs')¶
For each estimated GP, computes its true support using the GRAANK approach and returns the statistics (% error, and standard deviation).
>>> import so4gp as sgp >>> import pandas >>> dummy_data = [[30, 3, 1, 10], [35, 2, 2, 8], [40, 4, 2, 7], [50, 1, 1, 6], [52, 7, 1, 2]] >>> columns = ['Age', 'Salary', 'Cars', 'Expenses'] >>> dummy_df = pandas.DataFrame(dummy_data, columns=['Age', 'Salary', 'Cars', 'Expenses']) >>> >>> estimated_gps = list() >>> temp_gp = sgp.GP() >>> for gi_str in ['0+', '1-']: >>> temp_gp.add_gradual_item(sgp.GI.from_string(gi_str)) >>> temp_gp.support = 0.5 >>> estimated_gps.append(temp_gp) >>> temp_gp = sgp.GP() >>> for gi_str in ['1+', '3-', '0+']: >>> temp_gp.add_gradual_item(sgp.GI.from_string(gi_str)) >>> temp_gp.support = 0.48 >>> estimated_gps.append(temp_gp) >>> res = sgp.analyze_gps(dummy_df, min_sup=0.4, est_gps=estimated_gps, approach='bfs') >>> print(res) Gradual Pattern Estimated Support True Support Percentage Error Standard Deviation ['0+', '1-'] 0.5 0.4 25.0% 0.071 ['1+', '3-', '0+'] 0.48 0.6 -20.0% 0.085
- Parameters:
data_src (DataFrame | str) – Data set file
min_sup (float) – Minimum support (set by user)
est_gps (list[GP]) – Estimated GPs
approach (str) – ‘Bfs’ (default) or ‘dfs’
- Returns:
Tabulated results
- Return type:
str