Apriori Algorithm for Market Basket Analysis

advertisement
1|Page
Michael Behan
Project 1
CS 634
Part1: Program Input and Output
Part2: Instruction Manual
Part3: Source Code and Documentation.
Part4: video of Run1-Run5 on the datasets.
Introduction:
This project is on association rule mining. It is a popular and well researched method for
discovering interesting relations between variables in databases. It is most often used in market basket
analysis to retrieve strong association rules. An association rule is strong if it meet a user defied support
and confidence threshold. The Algorithm used in this project is the is Apriori it works by identifying the
frequent individual items in the transactional database and extending them to larger and larger sets as
long as those item sets appear often enough in the database. It works on the principle if a set in not
contained in the frequent item set then none of its super sets with be. Rule are written in the form
milk" => "eggs" (20%, 50%)20% being the support threshold and 50% the confidence. The project is
written in C# using Visual Studio 2012 Winforms in a Windows 7 environment.
Part1:
This section show the input and output of all 5 runs of the program.
Data: the data in this project is 10 strings representing the 10 items (listed below)
"milk", "eggs", "butter", "meat", "bread",
"chips", "coffee", "soda", "fruit", "vegetable"
Each transaction is represented by a line in a text file or a tuple in MySQL(as tinytext)
Michael Behan | NJIT CS 634
2|Page
Run 1
This is on set1.txt Parameters for support = 30 and confidence = 50
Input:
Output:
Michael Behan | NJIT CS 634
3|Page
Run 2
This is on set2.txt Parameters for support = 40 and confidence = 70
Input:
Output:
Run3
This is on set3.txt Parameters for support = 20 and confidence = 60
Input:
Michael Behan | NJIT CS 634
4|Page
Output:
Michael Behan | NJIT CS 634
5|Page
Michael Behan | NJIT CS 634
6|Page
Run4
This is on set4.txt Parameters for support = 40 and confidence = 60
Input:
Output:
Run5
This run is on a MySQL database containing 20 transactions Parameters support = 40 confidence = 50
Input:
Michael Behan | NJIT CS 634
7|Page
Output:
Michael Behan | NJIT CS 634
8|Page
Part2: Instruction Manual
In this section I will be going over the UI and instructions for Run1 and Run5. Runs 2-4 were all
do the same way as Run 1.
Run1 Step1: Click Open File and Choose file
Step2: Textbox3 and chart are populated now enter support and confidence in corresponding textboxes.
Michael Behan | NJIT CS 634
9|Page
Step3 Click Run Apriori and Frequent item Sets and Association Rules are generated in textBox3.
Run5 Step1: Click Use Database, Listbox1 and chart1 are populated now enter support and confidence in
the corresponding textboxes.
Michael Behan | NJIT CS 634
10 | P a g e
Step 2: Now Click Run Apriori and Frequent item Sets and Association Rules are generated in textBox3
Part3: Source Code and Documentation
There are 5 classes in this project
MyApriori.cs
AssocationRules.cs
Subsets.cs
ItetsetDB.cs
ItemSet.cs
The main is Form1.cs
All Source Code is listed below followed by a description of the major functions.
Michael Behan | NJIT CS 634
11 | P a g e
Itemset.cs
using
using
using
using
using
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
namespace DataMiningP1
{
class Itemset : List<string>
{
public double Support { get; set; }
public bool IsIn(Itemset itemset)
{
return (this.Intersect(itemset).Count() == itemset.Count);
}
public Itemset RemoveItemset(Itemset itemset)
{
Itemset removedset = new Itemset();
removedset.AddRange(from i in this
where !itemset.Contains(i)
select i);
return (removedset);
}
public string ToOutputString()
{
return ("{" + string.Join(", ", this.ToArray()) + "}" +
Math.Round(this.Support) + "%" );
}
" | Support = " +
public string ToItemsetString()
{
return ("{" + string.Join(", ", this.ToArray()) + "}");
}
}
}
Description:
Itemset.cs is a List<> of strings that represent each transaction.
double Support is to hold the support when printing out frequent itemsets.
bool IsIn(Itemset) returns true when items are in the itemset.
Itemset RemoveItemset (Itemset) returns a new itemset that does not contain an itemset
(is used when generating association rules to compare two sets).
ToOutputString() is used to print out frequent itemsets.
ToItemsetString() is used to print out each itemset.
Michael Behan | NJIT CS 634
12 | P a g e
ItemsetDB.cs
using
using
using
using
using
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
namespace DataMiningP1
{
class ItemsetDB : List<Itemset>
{
public Itemset UniqueItems()
{
Itemset unique = new Itemset();
foreach (Itemset itemset in this)
{
unique.AddRange(from i in itemset
where !unique.Contains(i)
select i);
}
return (unique);
}
public double FindSupportOfItem(Itemset itemset)
{
int TotalItemCount = (from j in this
where j.IsIn(itemset)
select j).Count();
double Support = ((double)TotalItemCount / (double)this.Count) * 100.00;
return (Support);
}
public int ItemCount(Itemset itemset)
{
int TotalItemCount = (from k in this
where k.IsIn(itemset)
select k).Count();
return (TotalItemCount);
}
public string ToItemsetDBString()
{
return (string.Join("\r\n", (from itemset in this select
itemset.ToItemsetString()).ToArray()));
}
}
}
Description:
ItemsetDB.cs is a List<> of Itemsets(derived from the Itemset class) it is used to
represent a transactional database.
Itemset UniqueItems() this function returns a new Itemset of unique items.
Michael Behan | NJIT CS 634
13 | P a g e
double FindSupportOfItem (Itemset) returns the Support of an itemset using the ISIN
function and divides the count of an itemset by the total count of transactions.
int ItemCount(Itemset) returns the count of an itemset (used to populate chart)
string ToItemsetDBString() it used to print out the transactional database.
Subsets.cs
using
using
using
using
using
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
namespace DataMiningP1
{
class Subsets
{
public static ItemsetDB FindAllSubsets(Itemset itemset, int n)
{
ItemsetDB Allsubsets = new ItemsetDB();
int SubsetsCount = (int)Math.Pow(2, itemset.Count);
for (int i = 0; i < SubsetsCount; i++)
{
if (n == 0 || Ones(i, itemset.Count) == n)
{
string Binary = ToBin(i, itemset.Count);
Itemset Subset = new Itemset();
for (int j = 0; j < Binary.Length; j++)
{
if (Binary[j] == '1')
{
Subset.Add(itemset[j]);
}
}
Allsubsets.Add(Subset);
}
}
return (Allsubsets);
}
public static int FindBit(int i, int p)
{
int bit = i & (int)Math.Pow(2, p);
return (bit > 0 ? 1 : 0);
}
public static string ToBin(int i, int l)
{
string Binary = string.Empty;
Michael Behan | NJIT CS 634
14 | P a g e
for (int p = 0; p < l; p++)
{
Binary = FindBit(i, p) + Binary;
}
return (Binary);
}
public static int Ones(int i, int l)
{
string Binary = ToBin(i, l);
return (from char ch in Binary.ToCharArray()
where ch == '1'
select ch).Count();
}
}
}
Description:
The Subsets Class (Subsets.cs) is used to Find all Subsets iteratively where all possible
subsets equal (2^n-1) n being the number of items in the set. The binary representation
of a number is used to represent a set example set S = {1,2,3} {1} = 001 {2} = 010 {1,2}
= 011 {3} = 100 {1,3} = 101 {2,3} = 100 and {1,2,3} = 111 so all possible sets can be
represented by the corresponding binary numbers.
ItemsetDB FindAllSubsets(Itemset,int) This function generates all n subsets by checking
if there is a one in the binary string and adding that subset.
int FindBit(int,int) This function is used to find the binary representation of a decimal
number returns 1 or 0 depending on position.
string ToBin(int,int)This
binary string.
function is used with the FindBit() function to generate a
int Ones(int,int) This function counts the number of 1’s in a binary string and returns
that count.
AssociationRules.cs
using
using
using
using
using
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
namespace DataMiningP1
{
class AssociationRules
{
public Itemset I1 { get; set; }
public Itemset I2 { get; set; }
public double Support { get; set; }
public double Confidence { get; set; }
Michael Behan | NJIT CS 634
15 | P a g e
public AssociationRules()
{
I1 = new Itemset();
I2 = new Itemset();
}
public string ToAssociationRulesString()
{
return (I1.ToItemsetString() + " => " + I2.ToItemsetString() + " | Support =
" + Math.Round(Support) + "%, Confidence = " + Math.Round(Confidence) + "%");
}
}
}
Description:
AssociationRules.cs This class is used to represent a generated association rule.
double Support holds the support of the association rule.
Double Confidence holds the confidence of the association rule.
Itemset I1,I2 are used to hold the 2 itemsets being compared I1 => 12 new instances are
initialized using the constructor AssociationRules().
ToAssociationRulesString() is used to print out association rules.
MyApriori.cs
using
using
using
using
using
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
namespace DataMiningP1
{
class MyApriori
{
public static ItemsetDB Apriori(ItemsetDB DB, double SupportPercent)
{
Itemset Aitem = DB.UniqueItems();
ItemsetDB AllItem = new ItemsetDB();
ItemsetDB IItem = new ItemsetDB();
ItemsetDB CItem = new ItemsetDB();
foreach (string item in Aitem)
{
CItem.Add(new Itemset() { item });
}
int n = 2;
Michael Behan | NJIT CS 634
16 | P a g e
while (CItem.Count != 0)
{
IItem.Clear();
foreach (Itemset itemset in CItem)
{
itemset.Support = DB.FindSupportOfItem(itemset);
if (itemset.Support >= SupportPercent)
{
IItem.Add(itemset);
AllItem.Add(itemset);
}
}
CItem.Clear();
CItem.AddRange(Subsets.FindAllSubsets(IItem.UniqueItems(), n));
n += 1;
}
return (AllItem);
}
public static List<AssociationRules> DataMining(ItemsetDB DB, ItemsetDB Allitem,
double ConfidencePercent)
{
List<AssociationRules> GeneratedRules = new List<AssociationRules>();
foreach (Itemset itemset in Allitem)
{
ItemsetDB rulesets = Subsets.FindAllSubsets(itemset, 0);
foreach (Itemset set in rulesets)
{
double confidence = (DB.FindSupportOfItem(itemset) /
DB.FindSupportOfItem(set)) * 100.0;
if (confidence >= ConfidencePercent)
{
AssociationRules Nrule = new AssociationRules();
Nrule.I1.AddRange(set);
Nrule.I2.AddRange(itemset.RemoveItemset(set));
Nrule.Support = DB.FindSupportOfItem(itemset);
Nrule.Confidence = confidence;
if (Nrule.I1.Count > 0 && Nrule.I2.Count > 0)
{
GeneratedRules.Add(Nrule);
}
}
}
}
return (GeneratedRules);
}
}
}
Description: MyApriori.cs This class contains two major functions. The first is the
Apriori algorithm used to find frequent itemsets. The second Datamining is used to
generate association rules that meet the user defined support and confidence threshold.
For rule I1 => I2 support = the occurrence of I1 in database and confidence = support of
I1 union I2 divide by the support of I1.
Michael Behan | NJIT CS 634
17 | P a g e
ItemsetDB Apriori(ItemsetDB,double) The function starts by initializing an Itemset Aitem
with a unique itemset (by calling the UniqueItems function in the ItemsetDB class). It
then creates 3 new instances of ItemsetDB’s Allitem(this holds the total frequent
itemsetDB and will be passed to the DataMining function), Iitem(this holds the itemsetDB
for each iteration), Citem ( this holds the candidate itemsetDB in each iteration). First
a 1-item candidate set is generated from Aitem. Next, IItem is cleared (it is empty now
but will not be on the next iteration). Next, each itemset in CItem support is checked if
it is above the user defined support threshold then it is added to Iitem and AllItem.
Next, CItem is cleared. Then a 2-item candidate set of frequent items is added to CItem
from IItem(by calling the FindAllSubsets() function from the Subsets class with parameter
IItem(UnqiqueItems() in the ItemsetDB class is called on IItem in first parameter). The
second parameter is n. Which is the count that is initialized to 2 and incremented after
the previous function call. This process continues until the CItem.Count != 0 condition
is met and Allitems is returned from the function after iteration.
List <AssociationRules>DataMining(ItemsetDB, ItemsetDB, double) This function starts by
initializing a List<> of AssociationRules. The function now iterates though Allitem which
was passed to the function finding every subset of an itemset by using the FindAllSubsets
function and checking if the confidence of that subset is greater than the user defined
threshold. It adds the rule if the requirement is met, and both I1 and I2 contain an
itemset. Finally, the function returns a list of AssociationRules after iteration of
Allitems.
Form1.cs
using
using
using
using
using
using
using
using
using
using
using
using
System;
System.Collections.Generic;
System.ComponentModel;
System.Data;
System.Drawing;
System.Linq;
System.Text;
System.Threading.Tasks;
System.Windows.Forms;
System.IO;
MySql.Data.MySqlClient;
System.Windows.Forms.DataVisualization.Charting;
namespace DataMiningP1
{
public partial class Form1 : Form
{
public string Fname;
private Itemset Kmart_items;
private ItemsetDB Kmart_db;
public Form1()
{
InitializeComponent();
}
public void fill_listbox()
{
Michael Behan | NJIT CS 634
18 | P a g e
textBox3.Text = string.Empty;
string KmartConnection =
"datasource=localhost;port=3306;username=root;password=;";
string Query = "select * from kmarttran.kmartrecords ;";
MySqlConnection KmartConDB = new MySqlConnection(KmartConnection);
MySqlCommand KmartCommandDB = new MySqlCommand(Query, KmartConDB);
MySqlDataReader KmartReader;
try
{
KmartConDB.Open();
KmartReader = KmartCommandDB.ExecuteReader();
Kmart_db = new ItemsetDB();
while (KmartReader.Read())
{
Kmart_items = new Itemset();
char[] delimiterChars = { ' ', ',' };
string Transactions = KmartReader.GetString("kmarttranactions");
listBox1.Items.Add(Transactions);
Kmart_items.AddRange(Transactions.Split(delimiterChars,
StringSplitOptions.RemoveEmptyEntries));
Kmart_db.Add(Kmart_items);
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
Itemset k = Kmart_db.UniqueItems();
chart1.Series.Clear();
chart1.Series.Add("Items");
chart1.Series["Items"].ChartType = SeriesChartType.Column;
foreach (string item in k)
{
Itemset temp = new Itemset();
temp.Add(item);
chart1.Series["Items"].Points.Add(Kmart_db.ItemCount(temp));
chart1.Series["Items"].Points[k.IndexOf(item)].AxisLabel = item;
}
}
private void Form1_Load(object sender, EventArgs e)
{
string[] kmartItem = { "milk", "eggs", "butter", "meat", "bread", "chips",
"coffee", "soda", "fruit", "vegetable" };
foreach (string kitem in kmartItem)
{
comboBox1.Items.Add(kitem);
}
comboBox1.SelectedIndex = 0;
}
private void AddaLine(string line)
{
textBox3.Text += line + "\r\n";
Michael Behan | NJIT CS 634
19 | P a g e
}
private void button2_Click(object sender, EventArgs e)
{
OpenFileDialog dlg = new OpenFileDialog();
dlg.ShowDialog();
if (dlg.ShowDialog() == DialogResult.OK)
{
string fileName;
fileName = dlg.FileName;
Fname = fileName;
}
char[] delimiterChars = { ' ', ',' };
Kmart_db = new ItemsetDB();
System.IO.StreamReader file = new System.IO.StreamReader(@Fname);
string line;
while((line = file.ReadLine()) != null)
{
Kmart_items = new Itemset();
Kmart_items.AddRange(line.Split(delimiterChars,
StringSplitOptions.RemoveEmptyEntries));
Kmart_db.Add(Kmart_items);
}
textBox3.Text = Kmart_db.ToItemsetDBString();
Itemset k = Kmart_db.UniqueItems();
chart1.Series.Clear();
chart1.Series.Add("Items");
chart1.Series["Items"].ChartType = SeriesChartType.Column;
foreach (string item in k)
{
Itemset temp = new Itemset();
temp.Add(item);
chart1.Series["Items"].Points.Add(Kmart_db.ItemCount(temp));
chart1.Series["Items"].Points[k.IndexOf(item)].AxisLabel = item;
}
}
private void button1_Click(object sender, EventArgs e)
{
textBox3.Text = string.Empty;
double userSupport = double.Parse(textBox1.Text);
ItemsetDB FrequentItems = MyApriori.Apriori(Kmart_db, userSupport);
AddaLine(FrequentItems.Count + " Frequent Itemsets");
foreach (Itemset itemset in FrequentItems)
{
AddaLine(itemset.ToOutputString());
}
AddaLine(string.Empty);
double userConfidence = double.Parse(textBox2.Text);
List<AssociationRules> allAssociationRules = MyApriori.DataMining(Kmart_db,
FrequentItems, userConfidence);
AddaLine(allAssociationRules.Count + " Association Rules");
foreach (AssociationRules rules in allAssociationRules)
{
AddaLine(rules.ToAssociationRulesString());
}
Michael Behan | NJIT CS 634
20 | P a g e
}
private void button4_Click(object sender, EventArgs e)
{
fill_listbox();
}
}
}
UI Names in code:
Description:
button2: This function opens a text file and reads it in line by line.
It creates a new Itemset for every line(by Spliting ‘,’ and Removing empty entries) and
the adds it To the Kmart_db( my transactional database) It Populates textBox3 with the
transactional database and Chart1 with count of each individual item.
button4: This function calls the filllistbox() function which connects to a local MySQl
database It creates a new Itemset for each transaction record in database then adds it to
the Kmart_db same as button2(but populates listBox1 one with the database instead of
textBox3)
button1: This function prints out the frequent itemsets(by calling the Apriori function
in the MyApriori class) and Association Rules(by calling the Data mining function in the
Michael Behan | NJIT CS 634
21 | P a g e
MyApriori Classs on allAssociationRules which is a List<AssociationRules>). It calls the
AddaLine() function for each line of output.
Additional Notes on Form1.cs
Form1_load lists the Items I will be using from an array in to Combobox1.
Chart1 is a Column graph of the count of each individual item. It allows the user to
visualize the support of an individual item easily by dividing the value by the total 20
items.
Part4: Video
Video of run1 though run5 should be below if not it is on YouTube at
http://www.youtube.com/watch?v=xJypZW4G0ig&feature=youtu.be
Michael Behan | NJIT CS 634
Download