Algorithms¶
To run each algorithm to extract GPs, follow the instructions that follow.
1. GRAANK¶
GRAdual rANKing Algorithm for GPs (GRAANK) is the classical approach (initially proposed by Anne Laurent) for mining gradual patterns. All the remaining algorithms are variants of this algorithm.
import pandas as pd
import so4gp as sgp
dummy_data = [["2021-03", 30, 3, 1, 10], ["2021-04", 35, 2, 2, 8], ["2021-05", 40, 4, 2, 7], ["2021-06", 50, 1, 1, 6], ["2021-07", 52, 7, 1, 2]]
dummy_df = pd.DataFrame(dummy_data, columns=['Date', 'Age', 'Salary', 'Cars', 'Expenses'])
mine_obj = sgp.algorithms.GRAANK(data_source=dummy_data, min_sup=0.5, eq=False)
gp_json = mine_obj.discover()
print(gp_json)
where you specify the parameters as follows:
data_source - [required] data source {either a
file in csv formator aPandas DataFrame}min_sup - [optional] minimum support
default = 0.5eq - [optional] encode equal values as gradual
default = False
2. ACO-GRAANK¶
ACO-GRAANK stands for Ant Colony Optimization for GPs. In this approach, it is assumed that every column can be converted into a gradual item (GI). If the GI is valid (i.e., its computed support is greater than the minimum support threshold), then it is either increasing or decreasing (+ or -), otherwise it is irrelevant (x). Therefore, a pheromone matrix is built using the number of columns and the possible variations (increasing, decreasing, irrelevant) or (+, -, x). The algorithm starts by randomly generating GP candidates using the pheromone matrix, each candidate is validated by confirming that its computed support is greater or equal to the minimum support threshold. The valid GPs are used to update the pheromone levels and better candidates are generated.
import pandas as pd
import so4gp as sgp
dummy_data = [["2021-03", 30, 3, 1, 10], ["2021-04", 35, 2, 2, 8], ["2021-05", 40, 4, 2, 7], ["2021-06", 50, 1, 1, 6], ["2021-07", 52, 7, 1, 2]]
dummy_df = pd.DataFrame(dummy_data, columns=['Date', 'Age', 'Salary', 'Cars', 'Expenses'])
mine_obj = sgp.algorithms.AntGRAANK(data_source=dummy_df, min_sup=0.5, max_iter=3, e_factor=0.5)
gp_json = mine_obj.discover()
print(gp_json)
where you specify the parameters as follows:
data_src - [required] data source {either a
file in csv formator aPandas DataFrame}min_sup - [optional] minimum support
default = 0.5max_iteration – [optional] maximum number of iterations
default = 1evaporation_factor - [optional] evaporation factor
default = 0.5
3. GA-GRAANK¶
GA-GRAANK stands for Genetic Algorithm for GPs. In this approach, it is assumed that every GP candidate may be represented as a binary gene (or individual) that has a unique position and cost. The cost is derived from the computed support of that candidate, the higher the support value, the lower the cost. The aim of the algorithm is to search through a population of individuals (or candidates) and find those with the lowest cost as efficiently as possible.
import pandas as pd
import so4gp as sgp
dummy_data = [["2021-03", 30, 3, 1, 10], ["2021-04", 35, 2, 2, 8], ["2021-05", 40, 4, 2, 7], ["2021-06", 50, 1, 1, 6], ["2021-07", 52, 7, 1, 2]]
dummy_df = pd.DataFrame(dummy_data, columns=['Date', 'Age', 'Salary', 'Cars', 'Expenses'])
mine_obj = sgp.algorithms.GeneticGRAANK(data_source=dummy_df, min_sup=0.5, max_iter=5, n_pop=5)
gp_json = mine_obj.discover()
print(gp_json)
where you specify the parameters as follows:
data_src - [required] data source {either a
file in csv formator aPandas DataFrame}min_sup - [optional] minimum support
default = 0.5max_iteration – [optional] maximum number of algorithm iterations
default = 1n_pop - [optional] initial population
default = 5pc - [optional] offspring population multiple
default = 0.5gamma - [optional] crossover rate
default = 1mu - [optional] mutation rate
default = 0.9sigma - [optional] mutation rate
default = 0.9
4. PSO-GRAANK¶
PSO-GRAANK stands for Particle Swarm Optimization for GPs. In this approach, it is assumed that every GP candidate may be represented as a particle that has a unique position and fitness. The fitness is derived from the computed support of that candidate, the higher the support value, the higher the fitness. The aim of the algorithm is to search through a population of particles (or candidates) and find those with the highest fitness as efficiently as possible.
import pandas as pd
import so4gp as sgp
dummy_data = [["2021-03", 30, 3, 1, 10], ["2021-04", 35, 2, 2, 8], ["2021-05", 40, 4, 2, 7], ["2021-06", 50, 1, 1, 6], ["2021-07", 52, 7, 1, 2]]
dummy_df = pd.DataFrame(dummy_data, columns=['Date', 'Age', 'Salary', 'Cars', 'Expenses'])
mine_obj = sgp.algorithms.ParticleGRAANK(data_source=dummy_df, min_sup=0.5, max_iter=5, n_particles=5)
gp_json = mine_obj.discover()
print(gp_json)
where you specify the parameters as follows:
data_src - [required] data source {either a
file in csv formator aPandas DataFrame}min_sup - [optional] minimum support
default = 0.5max_iteration – [optional] maximum number of algorithm iterations
default = 1n_particles - [optional] initial particle population
default = 5velocity - [optional] particle velocity
default = 0.9coeff_p - [optional] personal coefficient rate
default = 0.01coeff_g - [optional] global coefficient
default = 0.9
5. LS-GRAANK¶
LS-GRAANK stands for Local Search for GPs. In this approach, it is assumed that every GP candidate may be represented as a position that has a cost value associated with it. The cost is derived from the computed support of that candidate, the higher the support value, the lower the cost. The aim of the algorithm is to search through a group of positions and find those with the lowest cost as efficiently as possible.
import pandas as pd
import so4gp as sgp
dummy_data = [["2021-03", 30, 3, 1, 10], ["2021-04", 35, 2, 2, 8], ["2021-05", 40, 4, 2, 7], ["2021-06", 50, 1, 1, 6], ["2021-07", 52, 7, 1, 2]]
dummy_df = pd.DataFrame(dummy_data, columns=['Date', 'Age', 'Salary', 'Cars', 'Expenses'])
mine_obj = sgp.algorithms.HillClimbingGRAANK(data_source=dummy_df, min_sup=0.5, max_iter=3, step_size=0.5)
gp_json = mine_obj.discover()
print(gp_json)
where you specify the parameters as follows:
data_src - [required] data source {either a
file in csv formator aPandas DataFrame}min_sup - [optional] minimum support
default = 0.5max_iteration – [optional] maximum number of algorithm iterations
default = 1step_size - [optional] step size
default = 0.5
6. RS-GRAANK¶
RS-GRAANK stands for Random Search for GPs. In this approach, it is assumed that every GP candidate may be represented as a position that has a cost value associated with it. The cost is derived from the computed support of that candidate, the higher the support value, the lower the cost. The aim of the algorithm is to search through a group of positions and find those with the lowest cost as efficiently as possible.
import pandas as pd
import so4gp as sgp
dummy_data = [["2021-03", 30, 3, 1, 10], ["2021-04", 35, 2, 2, 8], ["2021-05", 40, 4, 2, 7], ["2021-06", 50, 1, 1, 6], ["2021-07", 52, 7, 1, 2]]
dummy_df = pd.DataFrame(dummy_data, columns=['Date', 'Age', 'Salary', 'Cars', 'Expenses'])
mine_obj = sgp.algorithms.RandomGRAANK(data_source=dummy_df, min_sup=0.5, max_iter=3)
gp_json = mine_obj.discover()
print(gp_json)
where you specify the parameters as follows:
data_src - [required] data source {either a
file in csv formator aPandas DataFrame}min_sup - [optional] minimum support
default = 0.5max_iteration – [optional] maximum number of algorithm iterations
default = 1
7. ClusterGP¶
Clustering algorithm for GPs (Clu-GRAD). We lend the net-win concept from the work ‘Clustering Using Pairwise Comparisons’ proposed by R. Srikant to the problem of extracting gradual patterns (GPs). To mine for GPs, each feature yields two gradual items which we use to construct a bitmap matrix comparing each row to each other (i.e., (r1,r2), (r1,r3), (r1,r4), (r2,r3), (r2,r4), (r3,r4)).
In this approach, we convert the bitmap matrices into ‘net-win vectors.’ Finally, we apply spectral clustering to determine which gradual items belong to the same group based on the similarity of net-win vectors. Gradual items in the same cluster should have almost the same score vector.
import pandas as pd
import so4gp as sgp
dummy_data = [["2021-03", 30, 3, 1, 10], ["2021-04", 35, 2, 2, 8], ["2021-05", 40, 4, 2, 7], ["2021-06", 50, 1, 1, 6], ["2021-07", 52, 7, 1, 2]]
dummy_df = pd.DataFrame(dummy_data, columns=['Date', 'Age', 'Salary', 'Cars', 'Expenses'])
mine_obj = sgp.algorithms.ClusterGP(data_source=dummy_df, min_sup=0.5, max_iter=3, e_prob=0.1)
gp_json = mine_obj.discover()
print(gp_json)
where you specify the parameters as follows:
data_src - [required] data source {either a
file in csv formator aPandas DataFrame}min_sup - [optional] minimum support
default = 0.5e_probability - [optional] erasure probability
default = 0.5max_iteration – [optional] maximum iterations for estimating score vectors
default = 10
Sample Output¶
The default output is the format of JSON:
{
"Algorithm": "RS-GRAANK",
"Best Patterns": [
[["Age+", "Salary+"], 0.6],
[["Expenses-", "Age+", "Salary+"], 0.6]
],
"Iterations": 20
}