홈으로ArticlesAll Issue
ArticlesNext Task Size Prediction Method for FP-Growth Algorithm
  • Jeong-Hoon Lee*

Human-centric Computing and Information Sciences volume 11, Article number: 13 (2021)
Cite this article 4 Accesses
https://doi.org/10.22967/HCIS.2021.11.013

Abstract

Frequent pattern (FP) mining is useful for deriving meaningful information from data. Among related methods, FP-growth is one of the most widely used algorithms but has time and physical restrictions when applied to big data mining. FP-growth repeatedly generates conditional FP-trees (CFP-trees) on memory and mines frequent patterns. As a result, memory shortage may occur when processing big data with FP-growth. If memory overflow occurs during FP-growth, mining will no longer be performed afterward. If memory overflow is predicted in advance, the error can be handled using methods such as switching to secondary memory-based mining method or waiting until the memory is released and subsequently proceeding with the mining. In this paper, the next task size prediction method is proposed to predict whether the main memory is sufficient to operate the next task during FP-growth process. Furthermore, it can also help in the parallel and distributed processing of FP-growth. The next task size prediction method can help solve the load balancing problem by allocating the remaining memory-sized task in each workstation or appropriately sized tasks to the workstations.


Keywords

FP-Growth, PPFP, Load Balancing, Parallel FP-Growth, PNC, Big Data


Introduction

Since 1993 when the concept of the association rule [1] was introduced, the task of analyzing frequent patterns (FP) in data mining has been researched actively. The FP-growth algorithm suggested by Han et al. [2] is one of the most popular FP algorithms. Unlike the existing Apriori-based methods [3], the FP-growth algorithm compresses FP into an FP-tree. Concerning pattern extraction, this algorithm is widely used in various fields such as biology [4], commerce [5, 6], education [7], recommendation [8], healthcare [911], sports [12], earth science [13], and Internet of Things [14, 15].
According to the additional research conducted by Pei et al. [16], however, FP-growth still has several problems in processing big data. The FP-tree is used to compress the transaction data but is unable to do so to a maximum extent. Therefore, these data take up a large memory space. In addition, it generates a considerable volume of conditional FP-tree (CFP-tree) data and consumes considerable time and space.
Several methods were proposed to overcome these problems. In [17, 18], the authors adopted a method of pruning unnecessary areas in a tree using minimum support and confidence in traditional FP mining to reduce the mining area. In [1922], the FP-tree size was reduced by analyzing only the patterns corresponding to the most frequent or important top-K items. Therefore, memory usage and time consumption during pattern analysis were reduced because unnecessary patterns were excluded from the analysis scope. Since the minimum support threshold and reliability required to remove the irrelevant pattern were defined based on subjective criteria proposed by an analyst manually, however, objectivity was a problem. In addition, because this method did not allow automatically extracting meaningful frequent items or eliminating irrelevant ones, considerable time and space were consumed. Under these circumstances, mining could eventually fail.
In practice, having only the physical memory is not sufficient to perform mining on very large datasets by using these algorithms, which depend on the size of the primary memory. Goethals [23], Vaarandi [24], and Buehrer et al. [25] argued that the standard physical memory available in modern CPUs was insufficient to mine association rules from real-world datasets.
Aside from the main FP-tree, the FP-growth process also generates the CFP-tree repeatedly in the main memory. When mining frequent item set from big data with FP-growth, memory overflow may occur. If the size of the next task could be predicted during the FP-growth process, however, we could predict memory overflow in advance. Subsequently, before memory overflow occurs, the mining method could be changed to secondary memory-based mining. Alternatively, it could simply wait for the main memory to be released and proceed with the main memory-based mining method.
There are several FP-growth methods based on the secondary memory: DRFP-growth [26], BFP-growth [27], and SQL-based FP-growth [28]. Previously, we also developed a novel secondary memory-based FP-growth method called the push-and-pop FP algorithm (PPFP) [29]. These secondary memory-based FP-growth methods have improved scalability. As access to secondary storage systems is computationally expensive, however, they have limited time efficiency. On the other hand, because they rarely consume the main memory space, when the memory is insufficient to mine with the main memory-based FP-growth, it is possible to continue the mining process using these secondary memory-based FP-growth methods.
In addition, predicting the size of the next task can help in solving the load balancing problem corresponding to distributed FP-growth processing. As observed previously for various FP-growth methods, it is difficult to apply FP-growth algorithms to big data in a single process because of time and memory efficiency problems [23, 24, 30]. Therefore, research dedicated to parallel processing of FP-growth has been conducted. Concerning the FP mining of big data, the FP-growth methods for parallel processing have also been investigated. For example, the parallel FP-tree (PFP), proposed in [25] is an FP-growth algorithm based on the MapReduce method [3133]. PFP has a classic parallel form of FP-growth, and it can be used to processes mining by dividing the work into independent parallel streams. Other research studies have improved PFP on Spark and its application [34, 35]. Nonetheless, this parallel processing of FP-growth is still associated with problems such as load imbalance, etc. [36]. In parallel and distributed processing of FP-growth, the FP-growth process is divided into small tasks and simultaneously processed by multiple workstations. At this time, a load balancing problem occurs if the task cannot be allocated properly to each workstation. For example, if memory overflow occurs due to the allocation of a task in a size that cannot be processed in the memory of a specific workstation, an error recovery cost is incurred. As another example, when large-size tasks are concentrated on a specific workstation, the processing time of the workstation is longer than other workstations; therefore, the total processing time could take longer. On the other hand, the next task size prediction method proposed in this paper can help solve the load balancing problem by allocating the remaining memory-sized task to each processor or assigning appropriately sized tasks to the workstations.
Therefore, in this paper, a method of predicting the size of the next task during the FP-growth process is proposed. Using this method enables predicting the memory size needed for the next task during the FP-growth process.
The rest of this paper is organized as follows: Section 2 provides an overview of related works, focusing on FP-growth and PPFP; Section 3 introduces the next task size prediction method; Section 4 discusses the conducted performance study and presents the results of the analytical comparison using the incorrect prediction ratio; Section 5 summarizes the results of this study and outlines several future research issues.


Related Work

This section describes the original FP-growth method and FP-growth algorithm based on the secondary memory (PPFP) proposed in previous studies.


FP-Growth

The FP-growth algorithm performs FP mining in two stages. First, it generates an FP-tree by compressing frequent items from the transaction database using the tree data structure to enable efficient search. Second, FP mining is performed using an FP-tree.
At the first stage of FP-growth, the data are stored in the FP-tree’s expanded prefix tree structure, which compresses and stores the required information derived from the data in the tree structure by performing only two database scans. It determines the priority of items by calculating their frequency during the first scan, and arranges each set of items according to the priority and builds an FP-tree during the second scan.
After generating an FP-tree from the transaction database, FPs are mined by applying FP-growth, which proceeds from the bottom-most item in the FP-tree to the topmost one. The method is defined as follows: first, the algorithm obtains the path from the root to each node by examining the nodes since items in the header table are connected by links, prepares the conditional pattern base (CPB), and generates a CFP-tree just like when building an FP-tree; second, it gathers the paths from the CFP-tree generated for a node, prepares the CPB again, and generates a CFP-tree. Recursively, it extracts an FP generating the CFP-tree. The algorithm of FP-growth is provided in Pseudo Code 1.

Pseudo Code 1. FP-growth

Algorithm1 FP-growth(FP-tree, α)
1. if FP-tree contains a single path P
2.   for each combination β of the nodes on the path P do
3.     Generate pattern β∪α with support = minimum support of the nodes in β
4.   end for
5.   else
6.   for each item ai in the header table of FP-tree do
7.     Generate pattern β = ai∪α with support = ai.support
8.     Construct β’s conditional pattern base and then β’s conditional FP-tree CFP-tree
9.     if CFP-tree ≠ Ø then
10.       Call FP-growth(CFP-tree, β)
11.     end if
12.   end for
13. end if
FP-growth is one of the most actively researched and widely used methods in the field. Since this method depends on memory capacity and maintains a CFP-tree generated repeatedly in the memory, however, it is difficult to mine a large volume of transaction database concerning actual data due to the limited physical memory space.

PPFP Mining Method
In addition to the FP-growth algorithm that mines FP by generating recursive CFP-trees, the PPFP algorithm can be used to mine FP sequentially using the push-and-pop method with a single stack. To perform mining FP with the PPFP algorithm, the FP-tree data structure is adopted.
Pseudo Code 2 describes the PPFP algorithm in detail.

Pseudo Code 2. PPFP algorithm
Algorithm 2 PPFP
Input: FP-tree table constructed based on Algorithm 1 using FP-tree, minimum support threshold ξ,
Output: FP table
Method: Call stack-growth(), which is implemented as follows.
Procedure PPFP()
1.   make CFP-stack, /* CFP-stack: conditional pattern stack */
2.   insert {frequent item set} into CFP-stack
3.   α = POP one pattern from CFP-stack
4.   ifα ≠ Ø
5.     call pushAndPop(α)

Procedure pushAndPop(α)
1.   Insert α into FP table /* FP table is to collect the result FPs */
2.   β = collect α’s conditional pattern base
3.   ifβ ≠ Ø
4.     PUSH β into CFP-stack
5.   α = POP one item from CFP-stack
6.   ifα ≠ Ø
7.     call pushAndPop (α)
As the FP-growth method based on secondary memory, PPFP can be used to enhance scalability unlike prior-memory dependent algorithms. Moreover, it is more time-efficient than the previous secondary memory-based FP-growth methods but still more time-consuming than the main memory-based methods


Next Task Size Prediction Method for FP-Growth Process

This section describes the proposed next task size prediction method during FP-growth.
The FP-growth algorithm repeatedly generates CFP-trees and mines FPs. Therefore, one CFP-tree generation is regarded as one task. To predict the next task size, the size of the CFP-tree generated is estimated. The CFP-tree consists of nodes and links. Each node must have the item name and the frequency of the item on that node. The node links to the node with the same item name in the FP-tree. In the FP-growth process, the FP-tree is explored in a bottom-up manner, generating a CPB. Therefore, a node must have a link to its parent node. As such, each node should have item name, frequency, link to the next node with the same item name, and link to the parent node. Nonetheless, the composition and size of this information vary depending on which constitutes the FP-tree. Knowing the size of the constituted nodes and the number of nodes allows predicting the size of the entire tree. Therefore, this paper presents a method of predicting the number of nodes so that the size of the node making up a program can be applied to obtain the predicted memory size required.
The proposed next task size prediction method is discussed in three steps: first, predict the maximum number of nodes (maxNC) in a CFP-tree to be generated; then, adjust the maxNC by reflecting the ratio of the actually generated node count (ANC); finally, describe how to apply the proposed node-count prediction method to FP-growth.

MaxNC: Prediction of Maximum Node Count Generated Based on N-items
Determining the maxNC in a tree to be generated with the given items requires knowing the state of the maximum number of nodes in the tree. When the FP-tree expresses all the patterns that can be generated with the given items, the maximum node count can be derived. An FP-tree with the maximum node count is denoted as the max FP-tree (maximal FP-tree).
The tree structure shown in Fig. 1 is the max FP-tree using itemset {a, b, c, and d}. The item set is assumed to be sorted as {a, b, c, d} such that a has top priority and d has the lowest priority.
Fig. 1. Sample of the max FP-tree.


As shown in Fig. 1, among the child nodes in the root, the sub-tree of the node with the highest priority has the same structure as all remaining trees. In Fig. 1, node a’s sub-tree is the same as X3, which is the set of the remaining trees except node a’s sub-tree. The sub-tree of node b has X2 as a child tree.
If there is an item with higher priority than item a, X4 is added by copying it under the node corresponding to the newly added item under the root. In the max FP-tree, the structure is repeated such that a sub-tree of one item with high priority inherits the entire sub-tree model of items with low priority.
Based on this rule, for the node count generated according to the number of items, the maximum node count that can be generated is 1 with one item, 2+1with two items, and 4+(2+1)with three items and is increased to 8+(4+(2+1))with four items. According to this rule, the maximum node count (maxNC) that can be generated when the item of FP-tree is n is 1n 2n-1 plus the root node; that is, total x=1n 2x-1 +1=2n.
Equation (1) presents the formula for maxNC:

text(1)

In the case wherein the node count can be increased exponentially whenever the number of items is augmented, maintaining the cumulative history information of a CFP-tree is computationally infeasible in a system with limited memory resources.
Fig. 2 shows the accumulated CFP-trees when FP-growth progresses with an FP-tree depicted in Fig. 1. To obtain the maximum number of nodes that can be theoretically generated, all CFP-trees are assumed to be generated by using a max CFP-tree, and the CFP-tree is generated even when there is only one item in the CPB.
Fig. 2. The total node count in the memory during an FP-growth process.


When the item set is {a, b, c, d}, the maxNC in an FP-tree is 24 as shown in Fig. 2(1). Therefore, the total node count in the memory is 24. As it proceeds in a bottom-up fashion, it initiates mining from item d with the lowest priority. The ancestors of item d are {a, b, c}. Among them, the item with the lowest priority is c. Therefore, d’s max CFP-tree is generated as shown in Fig. 2(2). As all CFP-trees are assumed to be generated as a maximal tree, the node count of d’s CFP-tree is 23. Here, the node count in the memory is 24+23. The lowest priority item in d’s max CFP-tree is c, which has two ancestor items {a, b}. Therefore, cd’s max CFP-tree is generated as outlined in Fig. 2(3). The lowest item of cd’s max CFP-tree has ancestor a. The CFP-tree is assumed to be generated even when there is only one item in CPB. Therefore, bcd’s CFP-tree is generated as shown in Fig. 2(4). At this time, the node count in the memory, 24+23+22+21, is the maximum node count accumulated in memory when FP-growth is applied to items {a, b, c, d}.
This way, the “abcd” pattern is saved in the FP table, and the mining process on bcd is executed. After mining of the pattern bcd is completed, bcd’s CFP-tree is deleted from the memory. The mining process then returns to the cd’s next item. As a result, FPs are mined by repeatedly generating and deleting the CFP-tree, with the total node count in the memory changed according to the creation and deletion of a CFP-tree.
The following steps correspond to the method of finding the maximum node count accumulated in memory when FP-growth is performed on n items:

The total node count of the max FP-tree (maxNC): 2n

The node count of the first max CFP-tree: 2n-1

The node count of the second max CFP-tree: 2n-2

…………,
- The node count of the (n-2)th max CFP-tree: 2n-(n-2)
- The node count of the (n-1)th max CFP-tree: 2n-(n-1)
- The node count of the nth max CFP-tree: 2n-n
x=1n 2n-x
The maximum node count of an FP-tree with n items is 2n. The item set is denoted as I = {i 1, i2, i3, …, in-1, in} and arranged according to frequency. Here, i1 has the highest priority, and in has the lowest priority. To generate the maximum accumulated node count in memory by the FP-growth process, the CFP-tree of the last priority item is assumed to be created as the max CFP-tree with items in the upper path.
In case of item in, the maximum node count of in’s max CFP-tree is 2n-1. At this time, the accumulated node count in the memory is 2n+2n-1. The item with the lowest priority is then searched in in’s max CFP-tree. Since in-1 is the lowest priority item in in’s max CFP-tree, the max CFP-tree generated with in-1 is denoted as in-1in’s max CFP-tree. The node count of in-1in’s max CFP-tree is 2n-2. In such case, the accumulated node count in the memory is 2n+2n-1+2n-2. If the max CFP-tree for the lowest-priority item is repeatedly generated and accumulated in the memory, the maximum accumulated node count in the memory becomes x=1n 2n-x.

PNC: Prediction of the Node Count
It is necessary to predict the node count of the next generated CFP-tree to allocate the required memory space in the next step or distribute the load appropriately. In this section, an equation for predicting the node count in a CFP-tree to be generated in the next step is proposed.
The worst case of memory usage in FP-growth with n items is when the max CFP-tree for the lowest priority item is generated continuously. In this case, the predictable total node count is x=1n 2n-x. Literally, however, this is the maximum node count that can be generated theoretically, and most of the CFP-trees have a node count that is lower than the max FP-tree. Moreover, many items are deleted while the trees are generated according to the minimum support threshold, reliability, etc.
To reflect the ANC on the prediction node count equation (PNC), the ANC rate corresponding to the theoretical maxNC is used. The node count rate is denoted as NCR, whose equation is presented in (2):

text(2)

(main FP-tree’s NCRparentTree = 1)


To reflect the NCR of prior-generated trees, multiply the equation denominator by the maxNC and the parent tree’s NCR value. The NCR equation’s denominator value has to be less than or equal to maxNC since the NCR denominator corresponds to the node prediction count. The PNC cannot be greater than the maximum node count. When the parent tree’s NCR is greater than 1, the child tree’s NCR denominator will be greater than the maximum node count. Therefore, limit the maximum value of NCR to 1. Consequently, it has a value greater than 0 but less than or equal to 1. When NCR is close to 1, it means the prediction is accurate. Otherwise, when NCR is close to 0, it means the prediction is inaccurate. An NCR value of 1 means that the predicted node count and the actually generated one are the same. The value after subtracting NCR from 1 is denoted as the incorrect prediction ratio (IPR). After calculating NCR, it is possible to predict the next generated CFP-tree’s node count.
To estimate the node count required to generate the next CFP-tree, the maxNC and NCR equations are applied. The equation for obtaining the predicted node count in a CFP-tree is denoted as PNC and is shown in Equation (3):

text(3)

PNC is the node count prediction value. Since the node count must be a natural number, only a natural number should be considered in this case. To do that, when the decimal place is obtained, it is rounded up. If the result for values greater than 0.5 is rounded, the IPR can be higher. For results lower than 0.5, the IPR can be lower. As such, the criteria are ambiguous. If it is rounded down, NCR can frequently become larger than 1 since the number of significant decimal points will be discarded. Therefore, even if the IPR is higher, the values are rounded up below the decimal point to improve accuracy.

Implementation of PNC
Fig. 3 presents an example of PNC implementation. First, find the NCR of the main FP-tree. Fig. 3(1) shows the structure of the main FP-tree. Since the item count of the main FP-tree is 6, the main FP-tree’s maxNC is 26. The main FP-tree’s NCRparentTree is 1. The ANC of the main FP-tree is 22. Therefore, the NCR of the main FP-tree is ANC/⌊maxNC×NCRparentTree⌋ = 22/⌊26 × 1⌋ = 0.34375.
Fig. 3. Simple example for implementation of PNC (minimum supportthreshold ξ = 2).


Before generating the first CFP-tree, calculate the PNC of the first CFP-tree. The lowest priority of an item in the main FP-tree’s header table is f. Item f’s CPB is {abd:2, bd:1, d:1}, and frequent items are {d:4, a:2, b:2}. With 3 items, max NC is 23. Consequently, f’s PNC is⌊f's_maxNC(23)×mainFP-tree's¬_NCR(0.34375)⌋=⌊2.75⌋=3.Then, generate f’s CFP-tree. The generated node count of f’s CFP-tree is 4. Item f’s CFP-tree has only one single path. In the original FP-growth, when the CFP-tree has a single path, it does not generate more CFP-trees. Therefore, PNC is not calculated anymore. Then, return to the main FP-tree’s header table. The lowest priority of item among the remaining items is e. Here, e’s CPB is {acd:1, ac:2, ad:1, a:1, cd:1, c:1} as shown in Fig. 3(3); e has 3 frequent items {a:5, c:5, d:3}. Therefore, e’s maxNC is 23, and e’s PNC is ⌊23×0.34375⌋=⌊2.75⌋=3. Then, generate e’s CFP-tree, as shown in Fig. 3(4). The generated node count of e’s CFP-tree is 8. Therefore, e’s NCR is e’s ⌊maxNC(8)/e's ANC(8)⌋=1. As it has a multi-path, search the lowest item’s PNC in e’s header table. The lowest item in e’s header table is d. Fig. 3(5) shows how to identify de’s PNC.
Following this approach, the next generated CFP-tree node count can be predicted.Pseudo Code 3 represents the PNC implementation algorithm. Fig. 4 presents a flowchart that shows the work performed by the PNC implementation.

Pseudo Code 3. PNC Implementation
Algorithm 3 PNC_FP-growth(FP-tree, α, ncr)
1. if FP-tree contains a single path P
2.   for each combination β of the nodes on the path P do
3.     Generate pattern β∪α with support = minimum support of the nodes in β
4.   end for
5. else
6.   for each item ai in the header table of FP-tree do
7.     Generate pattern β = ai∪α with support = ai.support
8.     Construct β’s conditional pattern base
9.     β’s maxNC = 2β’s frequent items
10.     β’s pnc= ⌊β's maxNC×ncr⌋
11.     Construct β’s conditional FP-tree CFP-tree
12.     β’s anc= generated node count of β’s CFP-tree
13.     β’s ncr = β’s anc / β’s pnc
14.     if (β’s ncr > 1)
15.       β’s ncr = 1
16.     if CFP-tree ≠ Ø then
17.       Call PNC_FP-growth(CFP-tree, β, β’s ncr)
18.     end if
19.   end for
20. end if
Fig. 4. Flowchart indicating the performed work of PNC implementation.


Results

In this section, the experiment results of the next task size prediction method proposed in this study are presented. All our experiments were conducted on a PC running on 32-bit Windows 7 with Intel i5-3210M2.5 GHz CPU and 2-GB of DDR3 RAM. To test the validity of the results of this study, three datasets denoted as T10I4D1K, T20I4D1K, and T30I4D1K were prepared using the IBM quest market basket synthetic data generator for testing the proposed algorithm for data analysis and generating testing datasets.
In TXXIYYDZZ, XX meant the average transaction length, YY was the average maximum potential frequent itemset size, and ZZ denoted the total number of transactions. The output was written into file TXXIYYDZZ. Each line of the file corresponded to a transaction. The items in each transaction were represented by item numbers and separated by spaces.
To evaluate the accuracy of a suggested equation and analyze how the equation’s accuracy varied according to the size of the tree, three datasets were generated and evaluated. The considered datasets, T10I4D1K, T20I4D1K, and T30I4D1K, had the same data generation rule under the same condition except for the average transaction length. They had an average of four maximum potential frequent itemset size, and the total number of transactions was 1K. Nonetheless, the average transaction length differed, i.e., 10, 20, and 30. An increase made a tree deeper.
Three experiments were performed here. To facilitate the experiment results’ visibility, IPR, the opposite concept of the node count prediction accuracy ratio, was calculated considering 1-NCR. In the first experiment, the average NCR during FP-growth for each dataset was calculated. Then, the IPR corresponding to the main FP-tree’s NCR and the IPR obtained before generating the CFP-tree’s NCR were compared. The second experiment checked the sequential IPR during CFP-tree generation for the dataset T30I4D1K. Here, NCRparentTree was applied until the 3rd generated CFP-tree was obtained from the main FP-tree. After that, the 3rd NCR value was obtained. In the third experiment, the NCRparentTree values were computed until the last depth and used to compute NCR. To ensure accurate test results, the FP-tree pruning function was disabled so that no node count was omitted.
Fig. 5 presents the results of comparing the main FP-tree NCR (NCRmainTree) and previously generated CFP-tree NCR (NCRparentTree). First, the main FP-tree maxNC and ANC were obtained. Then, NCR was computed accordingly. Next, NCRmainTree was calculated during FP-growth to obtain the NCR of the CFP-trees. The size of the main FP-tree continuously affected the sizes of the CFP-trees obtained from the main FP-tree. Therefore, NCRmainTree alone could be used to predict the node count of CFP-trees to some extent.

Fig. 5. Performance comparison when applied to NCRparentTree and NCRmainTree.


When NCRparentTree was applied, the node count could be predicted more precisely. In this case, the CFP-tree NCR was computed using NCRparentTree. Therefore, whenever the CFP-tree was generated repeatedly, NCRparentTree was applied cumulatively. Consequently, the node count could be predicted more accurately.
As shown in Fig. 5, when NCRparentTree was applied, the IPR was much smaller compared to the case wherein NCRmainTree was applied. Therefore, by using the NCR equation proposed in this study, the IPR could be adjusted considerably as shown in the graph in Fig. 5.
The T30I4D1K itemset had a longer transaction than T10I4D3K. As a result, the main FP-tree corresponding to this data set was deeper than that of T10I3D3K. In Fig. 5, it could be seen that, when the transaction length increased, the IPR declined. When the depth of the main FP-tree was larger, it had more chances to obtain NCRparentTree for the CFP-trees derived from the main FP-tree. Therefore, the IPR was reduced.
In this experiment, as shown in Fig. 6, the CFP-tree NCR was computed during the FP-growth process based on the T30I4D1K data set. In this experiment, NCRparentTree was sampled until the 3rd depth. Afterward, the 3rd CFP-tree NCR was computed. Figure 6 shows the change in the IPR during the FP-growth process. As shown in Fig. 6, the IPR demonstrated a constant pattern. It could be observed that the IPR had the highest value when the first CFP-tree was derived from the main FP-tree but was reduced when CFP-trees were generated again from the parent CFP-tree.
In other words, when CFP-trees with larger depths and frequencies were generated, the PNC prediction equation could achieve higher accuracy. In addition, no IPRs exceeded those of the main FP-tree. In other words, the error between the total node count generated while the FP-growth was executed on T30I4D1K and the one expected to be generated through the PNC prediction equation occurred within the maximum threshold of 0.4%. This indicates that the PNC equation’s prediction accuracy could be considered appropriate in predicting the total node count that could be generated within the meaningful error range in analyzing a dataset of a large scale.
Fig. 6. Sequential IPR (1-NCR) during generation of CFP-trees (T30I4D1K) with NCRparentTree until 3rd depth.


In the second experiment, computations were performed only until the 3rd depth of NCRparentTree. Therefore, after the 3rd CFP-tree, the 3rd NCRparentTree value was applied. In this experiment, however, the NCRparentTree values were computed until the last depth and used to compute NCR. Compared with Fig. 6, Fig. 7 shows more blank spaces as the points where the IPR is 0. It indicates that the predicted node countand the actually generated one are almost equal. The results corresponded to the case wherein NCRparentTree was applied until the last depth had higher node count prediction accuracy than the case wherein NCRparentTree was computed until the 3rd depth only.
When a CFP-tree was generated during FP-growth, if the last depth of the CFP-tree was reached, it was returned to the previous depth. After each time when the link returned to the previous depth and a CFP-tree was generated, the CFP-tree’s IPR increased, and the IPR was augmented as well.
Fig. 7. Sequential IPR (1-NCR) during the generation of CFP-trees (T30I4D1K).


Conclusion

The experiment results confirm that the proposed next task prediction method achieves high prediction accuracy. In the case of the T30I4D1K dataset, the error between the total node count generated and the total node count expected to be generated using the PNC prediction equation is within a maximum of 0.4%. This indicates that the PNC equation’s prediction accuracy is appropriate for predicting the total node count that can be generated within the reasonable error range in analyzing data of a large scale. Moreover, prediction accuracy is shown to have a tendency to increase when the depth of a tree increases.
Our next task is to study how to apply PNC.
First, a hybrid FP-growth method is being studied so that FP mining can proceed in case of insufficient memory during FP-growth. To improve FP-growth scalability, we previously proposed the FP-growth method based on the secondary memory. Although this method allows improving scalability, it is less time-efficient than the original FP-growth algorithm. Therefore, we will focus on developing a hybrid FP-growth method that applies the original FP-growth method to mine FPsif the main memory is sufficient or the secondary memory-based method if the main memory is insufficient. Using the PNC described in this paper, it can predict memory shortage before a memory overflow occurs and switch to the secondary memory-based FP-growth method accordingly.
Next, the application of the distributed and parallel processing of the FP-growth will be studied. In the case of big data FP mining, it is difficult to perform data mining through a single process. Accordingly, many studies on distributed parallel processing for FP-growth have been conducted. Nonetheless, the load balancing problem may still occur concerning distributed processing. Using the next task size prediction method proposed in this paper can mitigate this problem by predicting the size of the next task and establishing efficient task allocation. After completing the research on the hybrid FP-growth method without memory overflows, we plan to develop a method of allocating tasks efficiently by applying PNC to the task allocation in distributed parallel FP-growth.


Acknowledgements

The author would like to thank Mi-Jin Lee (University of Dongguk) for discussing the issue related to frequent pattern mining.


Author’s Contributions

The author performed the design and implementation of the proposed method and read and approved the final manuscript.


Funding

None.


Competing Interests

The author declares no competing interests.


Availability of Data and Materials

The datasets generated and analyzed during the current study are available in the IBM data generator, https://github.com/zakimjz/IBMGenerator.


Author Biography

image
Name : Jeong-Hoon Lee
Affiliation : Department of Liberal Arts & Science, Gachon University, Seongnam, Korea
Biography :
2005Bachelor of Computer Engineering from Dongguk University in Seoul, Korea
2007 Master of Computer Engineering from Dongguk University in Seoul, Korea
2011 Ph.D. of Computer Engineering from Dongguk University in Seoul, Korea


References

[1] R. Agrawal, T. Imieliski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, 1993, pp. 207-216.
[2] J. Han, J. Pei, and Y. Yin. “Mining frequent patterns without candidate generation,” in Proceedings of the ACM SIGMOD Conference on Management of Data, Dallas, TX, 2000.
[3] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, 1994, pp. 487-499
[4] S. Mallik, T. Bhadra, and A. Mukherji, “DTFP-Growth: dynamic threshold-based FP-Growth rule mining algorithm through integrating gene expression, methylation, and protein–protein interaction profiles,” IEEE Transactions on Nanobioscience, vol. 17, no. 2, pp. 117-125, 2018.
[5] C. Zhang, Z. Li, T. Li, Y. Han, C. Wei, Y. Cheng, and Y. Peng, “P-CSREC: a new approach for personalized cloud service recommendation,” IEEE Access, vol. 6, pp. 35946-35956, 2018.
[6] Q. Yang, Q. Zhang, Y. Zhang, and F. Du, “Application research of parallel frequent itemset mining algorithm in book recommendation,” in Proceedings of 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), Xiamen, China, 2019, pp. 647-652.
[7] K. Dahdouh, A. Dakkak, L. Oughdir, and A. Ibriz, “Large-scale e-learning recommender system based on Spark and Hadoop,” Journal of Big Data, vo. 6, article no. 2, 2019. https://doi.org/10.1186/s40537-019-0169-4
[8] F. Liu, Y. Su, T. Wang, J. Fu, S. Chen, and C. Ju, “Research on FP-Growth algorithm for agricultural major courses recommendation in China Open University system,” in Proceedings of 2019 12th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 2019, pp. 167-170.
[9] W. N. Ismail, M. M. Hassan, and H. A. Alsalamah, “Context-enriched regular human behavioral pattern detection from body sensors data,” IEEE Access, vol. 7, pp. 33834-33850, 2019.
[10] A. Kazemi, A. Keshtkar, S. Rashidi, N. Aslanabadi, B. Khodadad, and M. Esmaeili, “Segmentation of cardiac fats based on Gabor filters and relationship of adipose volume with coronary artery disease using FP-Growth algorithm in CT scans,” Biomedical Physics & Engineering Express, vol. 6, no. 5, article no. 055009, 2020. https://doi.org/10.1088/2057-1976/aba441
[11] K. Davagdorj and K. H. Ryu, “Association rule mining on head and neck squamous cell carcinoma cancer using FP Growth algorithm,” in Proceedings of the International Conference on Information, System and Convergence Applications, Bangkok, Thailand, 2018.
[12] L. Ardiantoro and N. Sunarmi, “Badminton player scouting analysis using Frequent Pattern growth (FP-growth) algorithm,” Journal of Physics: Conference Series, vol. 1456, article no. 012023, 2020. https://doi.org/10.1088/1742-6596/1456/1/012023
[13] Y. Jiang, M. Zhao, C. Hu, L. He, H. Bai, and J. Wang, “A parallel FP-growth algorithm on World Ocean Atlas data with multi-core CPU,” The Journal of Supercomputing, vol. 75, no. 2, pp. 732-745, 2019.
[14] A. Shaban, F. Almasalha, and M. H. Qutqut, “Hybrid user action prediction system for automated home using association rules and ontology,” IET Wireless Sensor Systems, vol. 9, no. 2, pp. 85-93, 2019.
[15] M. Tang, Y. Xia, B. Tang, Y. Zhou, B. Cao, and R. Hu, “Mining collaboration patterns between APIs for mashup creation in Web of Things,” IEEE Access, vol. 7, pp. 14206-14215, 2019.
[16] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu, “Mining sequential patterns by pattern-growth: the prefixspan approach,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1424-1440, 2004.
[17] G. Grahne and J. Zhu, “Efficiently using prefix-trees in mining frequent itemsets,” in Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, Melbourne, FL, 2003.
[18] C. K. S. Leung and D. A. Brajczuk, “Efficient algorithms for mining constrained frequent patterns from uncertain data,” in Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, Paris, France, 2009, pp. 9-18.
[19] Y. Hirate, E. Iwahashi, and H. Yamana, “TF2P-growth: an efficient algorithm for mining frequent patterns without any thresholds,” in Proceedings of IEEE ICDM 2004 Workshop on Alternative Techniques for Data Mining and Knowledge Discovery, Brighton, UK, 2004.
[20] A. W. C. Fu, R. W. W. Kwong, and J. Tang, “Mining n-most interesting itemsets,” in Foundation Intelligent Systems. Heidelberg, Germany: Springer, 2000, pp. 59-67.
[21] J. Han, J. Wang, Y. Lu, and P. Tzvetkov, “Mining top-k frequent closed patterns without minimum support,” in Proceedings of 2002 IEEE International Conference on Data Mining,Maebashi City, Japan, 2002, pp. 211-218.
[22] Y. Zhang, P. Luo, and L. Qu, “An improved differential privacy algorithm using frequent pattern mining,” in Proceedings of 2019 15th International Conference on Computational Intelligence and Security (CIS), Macao, China, 2019, pp. 419-423.
[23] B. Goethals, “Memory issues in frequent itemset mining,” in Proceedings of the 2004 ACM Symposium on Applied Computing, Nicosia, Cyprus, 2004, pp. 530-534.
[24] R. Vaarandi, “A breadth-first algorithm for mining frequent patterns from event logs,” in Intelligence in Communication Systems. Heidelberg, Germany: Springer, 2004, pp. 293-308.
[25] G. Buehrer, S. Parthasarathy, and A. Ghoting, “Out-of-core frequent pattern mining on a commodity PC,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, 2006, pp. 86-95.
[26] G. Buehrer, S. Parthasarathy, and A. Ghoting, “Out-of-core frequent pattern mining on a commodity PC,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, 2006, pp. 86-95.
[27] M. Adnan and R. Alhajj, “A bounded and adaptive memory-based approach to mine frequent patterns from very large databases,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41, no. 1, pp. 154-172, 2011.
[28] X. Shang, K. U. Sattler, and I. Geist, “SQL based frequent pattern mining with FP-growth,” in Applications of Declarative Programming and Knowledge Management. Heidelberg, Germany: Springer, 2004, pp. 32-46.
[29] J. H. Lee and Y. A. Min, “PPFP (Push and Pop Frequent Pattern Mining): a novel frequent pattern mining method for bigdata frequent pattern mining,” KIPS Transactions on Software and Data Engineering, vol. 5, no. 12, pp. 623-634, 2016.
[30] K. Wang, L. Tang, J. Han, and J. Liu, “Top down FP-growth for association rule mining,” in Advances in Knowledge Discovery and Data Mining. Heidelberg, Germany: Springer, 2002, pp. 334-340.
[31] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, “PFP: parallel FP-growth for query recommendation,” in Proceedings of the 2008 ACM Conference on Recommender Systems, Lausanne, Switzerland, 2008, pp. 107-114.
[32] A. Segatori, A. Bechini, P. Ducange, and F. Marcelloni, “A distributed fuzzy associative classifier for big data,” IEEE Transactions on Cybernetics, vol. 48, no. 9, pp. 2656-2669, 2018.
[33] R. A. Deshmukh, H. N. Bharathi, and A. K. Tripathy, “parallel processing of frequent itemset based on MapReduce programming model,” in Proceedings of 2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA), Pune, India, 2019, pp. 1-6.
[34] Y. Miao, J. Lin, and N. Xu, “An improved parallel FP-growth algorithm based on Spark and its application,” in Proceedings of 2019 Chinese Control Conference (CCC), Guangzhou, China, 2019, pp. 3793-3797.
[35] J. Ragaventhiran and M. K. Kavithadevi, “Map-optimize-reduce: CAN tree assisted FP-growth algorithm for clusters based FP mining on Hadoop,” Future Generation Computer Systems, vol. 103, pp. 111-122, 2020.
[36] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” in Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI), San Francisco, CA, 2004, pp. 137-150.

About this article
Cite this article

Jeong-Hoon Lee*, Next Task Size Prediction Method for FP-Growth Algorithm, Article number: 11:13 (2021) Cite this article 4 Accesses

Download citation
  • Recived19 March 2020
  • Accepted22 December 2020
  • Published31 March 2021
Share this article

Anyone you share the following link with will be able to read this content:

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords