Use of WEKA software, CBR non-contingency planning
Recently, I have been studying CER non-contingency planning, which requires the creation of data sets and the use of WEKA software. There are few related articles on the Internet. Fortunately, I encountered this one, which solved many of my problems. Basic Use of WeKA – RongT – Blogpark (CNblogs.com)
Basic use of WEKA
Directory:
1. Introduction
2. Interface recognition
3. Data format
4. Data preparation
5. Association rules
6. Classification and regression
Cluster analysis
8.Weka related materials
9.Weka secondary development
10.Weka source code import
1. Introduction
WEKA is the full name of the Waikato intelligent Analysis Environment (Waikato Environment for Knowledge Analysis), its source code can be obtained via www.cs.waikato.ac.nz/ml/weka. With…). Open source Machine Learning and data mining software based on JAVA environment.
As an open data mining platform, WEKA integrates a large number of machine learning algorithms that can undertake data mining tasks, including data preprocessing, classification, regression, clustering, association rules and visualization in a new interactive interface. If you want to implement your own data mining algorithm, take a look at weKA’s interface documentation. It’s not too difficult to integrate your own algorithms into WEKA or even implement your own visualization tools using its methods.
2. Initial understanding of interface:
Run weka GUI:
[imG-vpedov0U-1635078327413] (./ Use of WEKA software, CBR non-accidental planning /1.png)
Click the Explorer button to open the Explorer interface:
[ImG-VLNCHSW4-1635078327415] (./ Use of WEKA, CBR non-accidental planning /2.png)
Several tabs in area 1 are used to switch between different mining task panels. The Preprocess panel area 2 is a few common buttons. Including open data, save and edit functions. It allows you to convert from CSV to ARFF
Select a Filter in area 3, which can Filter data or transform data. Data preprocessing is mainly realized by using it. Area 4 shows some of the basics of the dataset. All attributes of the dataset are listed in area 5. Properties can be removed by checking and “Remove”, and can be retrieved using the “Undo” button in zone 2. The row of buttons above area 5 is for quick checking. If you select an attribute in zone 5, there is a summary of the attribute in zone 6. Note that the summarization is different for numeric and categorical attributes. The figure shows a summary of the logarithmic property “income”. Region 7 is the histogram of selected properties in region 5. If the last attribute of the dataset (which we said is the default target variable for categorization or regression tasks) is a categorization variable (pep here happens to be), each rectangle in the histogram is divided into color-coded segments proportioned to that variable. To change the segmentation criteria, select a different category attribute from the drop-down box above area 7. Select “No Class” from the drop-down box or a numeric attribute will turn into a black and white histogram. Area 8 is the status bar where you can view the Log to see if there is an error. If the weka bird on the right is moving, it means weka is on a digging mission. You can also perform garbage collection of JAVA memory by right-clicking on the status bar.
3. Data format
[ImG-jbgodn5P-1635078327416] (./ Use of WEKA, CBR non-accidental planning /3.png)
A row in a table is called an Instance and is equivalent to a sample in statistics or a record in a database. A vertical row is called an attribute (Attrbute) and is equivalent to a variable in statistics or a field in a database. Such a table, or data set, in WEKA’s view, presents a Relation between attributes. In Figure 1, there are 14 instances with five properties, and the relationship name is “Weather.”
WEKA stores data in the attribute-relation File Format (ARFF) File, which is an ASCII text File. The two-dimensional table shown in Figure 1 is stored in the following ARFF file. This is the weather.arff file that comes with WEKA and can be found in the “data” subdirectory of the WEKA installation directory.
Note that when you open this file in Windows Notepad, the line may not work properly due to inconsistent carriage return definitions. You are advised to use character editing software such as UltraEdit to view ARFF files.
[ImG-L8qNbhsq-1635078327418] (./ Use of WEKA, CBR non-accidental planning /4.png)
Contents:
The important basis for identifying ARFF files is the line, so you can’t arbitrarily break lines in such files. Blank lines (or lines full of Spaces) are ignored. Lines starting with “%” are comments that WEKA will ignore. If you see a “weather.arff” file with a few more or fewer “%” lines, it doesn’t matter. With the comments removed, the entire ARFF file can be divided into two parts. The first part gives the Head information, including declarations for relationships and declarations for attributes. The second part presents the Data information, that is, the Data given in the Data set. Starting with the “@data” tag, the data information follows.
Relation declaration The relation name is defined in the first valid line of the ARFF file in the form @relation is a string. If the string contains Spaces, it must be quoted (single or double quotation marks for English punctuation).
Attribute declarations Attribute declarations are represented by a list of statements beginning with “@attribute”. Each attribute in the dataset has a corresponding “@attribute” statement that defines its attribute name and data type. The order in which these statements are declared is important. First, it indicates the location of the property in the data section. For example, the “humidity” attribute is the third to be declared, indicating that in the column separated by commas in the data section, the third column is 85, 90, 86, 96… Is the corresponding “humidity” value. Second, the last declared attribute is called the class attribute, which is the default target variable in categorization or regression tasks. The attribute declaration is of the format @attribute, which is a string that must start with a letter. As with relationship names, if the string contains Spaces, it must be quoted. WEKA supports four, Are numeric — — — — — — — — — — — — — — — — — — — — — — — — — numeric type classification (nominal) — — — — — a string — — — — — — — — — — — — — — — — — — — — — — — — — — — — the date string type [] — — — — — — — — the date and time The sum is described below. Two other types, “INTEGER” and “real,” can be used, but WEKA treats both as “numeric.” Note that the keywords “integer”, “real”, “numeric”, “date”, and “string” are case sensitive, while “relation”, “attribute”, and “date” are not.
Numeric attributes Numeric attributes can be integers or real numbers, but WEKA treats both as real numbers.
Category attributes Category attributes consist of a list of possible category names enclosed in curly braces: {,,… }. The value of this attribute in the dataset can only be one of the categories. For example, the following attribute declaration indicates that the Outlook attribute has three types: Sunny, Overcast, and Rainy. The outlook value for each instance in the dataset must be one of these three values. @attribute Outlook {sunny, overcast, rainy} If the category name has a space, you still need to put it in quotes.
String properties String properties can contain arbitrary text. This type of attribute is very useful in text mining. Example: @attribute LCC string
Date and time attributes are represented by the “date” type, in the format @attribute Date [], where the name of the attribute is a string that specifies how to parse and display the date or time format. The default string is in the format of YYYY-MM-DDTHh: MM: SS given by ISO-8601. Strings representing dates in the data information section must conform to the formatting requirements specified in the declaration (examples below).
Data Information The @data flag occupies a single line of data, and the rest is data for each instance.
Each instance is on a row. Instance attribute values are separated by commas (,). If the value of an attribute is missing value, use the question mark “? Represents, and the question mark cannot be omitted. For example: @data sunny,85,85,FALSE,no? , 78, living,? ,yes
The values of string attributes and category attributes are case sensitive. If the value contains Spaces, it must be enclosed in quotation marks. For example: @relation LCCvsLCSH @attribute LCC string @attribute LCSH string @data AG5, ‘incorporated as and dictionaries. Twentieth century.’ AS262, ‘Science — Soviet Union — History.’
The value of the date attribute must match that given in the attribute declaration. Such as: @RELATION Timestamps @ATTRIBUTE timestamp DATE “yyyy-MM-dd HH:mm:ss” @DATA “2001-04-03 12:12:12” “2001-05-03 12:59:55”
Sparse data sometimes contain a large number of 0 values in the data set (such as shopping basket analysis), so it is more space-saving to store data in sparse format. A sparse format is a representation of an instance of data information that does not require modification of other parts of the ARFF file. Take a look at the following data: @data 0, X, 0, Y, “class A” 0, 0, W, 0, “class B” @data {1 X, 3 Y, 4 “class A”} {2 W, 4 “class B”} Each instance is enclosed in curly braces. Each non-zero attribute value in the instance is represented by < space >. Is the serial number of the attribute, counting from 0; Is the attribute value. Property values are still separated by commas. Here the values of each instance must be written in the order of the attributes, such as {1 X, 3 Y, 4 “class A”}, not {3 Y, 1 X, 4 “class A”}. Note that attribute values not noted in the sparse format are not missing values, but 0 values. Missing values must be expressed explicitly with a question mark.
A new property type called Relational has been added to WEKA 3.5, which allows you to handle multiple dimensions like a Relational database. But this type is not widely used at present, temporarily not introduced.
— From www.cs.waikato.ac.nz/~ml/weka/ar… And weka.sourceforge.net/wekadoc/ind…
4. Data preparation
The first problem with WEKA data mining is that our data is not in ARFF format. Fortunately, WEKA also provides support for CSV files, which are supported by many other software. In addition, WEKA provides the ability to access databases through JDBC. In this section, we first use Excel and Matlab as examples to show how to obtain CSV files. Then we’ll see how CSV files can be converted to ARFF files, which is, after all, WEKA’s best supported file format. Faced with an ARFF file, we still have some preprocessing to do before we can do the mining task.
.* ->.csv Let’s give an example of a CSV file (bank-data.csv). If you open it with UltraEdit, you can see that this format is also a comma-separated data text file, which stores a two-dimensional table.
Excel’s XLS file allows multiple 2d tables to be placed in separate sheets, so we have to store each Sheet in a separate CSV file. Open an XLS file and switch to the worksheet to be converted, save as CSV type, click “OK”, “Yes” to ignore the prompt to complete the operation.
The two-dimensional table in Matlab is a matrix, we use this command to save a matrix into CSV format. Csvwrite (‘filename’, matrixName) Note that CSV files given by Matllab often do not have attribute names (or Excel may not). WEKA must read the attribute names from the first line of the CSV file, otherwise it will read the attribute values from the first line as variable names. Therefore, we need to use UltraEdit to open the CSV file given by Matllab and manually add a line of attribute names. Note that the number of attribute names must be the same as the number of data attributes, separated by commas.
.csv ->.arff The quickest way to convert CSV to ARFF is to use the command-line tools that come with WEKA. Run WEKA’s main program, and click the button below to enter the corresponding module after the GUI appears. We click to access the command line functionality provided by the “Simple CLI” module. At the bottom of the new window (top) is not write write Java input box weka. Core. Converters. CSVLoader filename. The CSV > filename. The arff can complete the transformation. WEKA 3.5 provides an “Arff Viewer” module that allows you to open a CSV file to browse and save as an Arff file. You can also go into the “Exploer” module and open the CSV file from the button at the top and save it as an ARFF file.
The meanings of each attribute of pre-processed bank-data data are as follows: id a unique identification number age age of customer in years (numeric) sex MALE / FEMALE region inner_city/rural/suburban/town income income of customer (numeric) married is the customer married (YES/NO) children number of children (numeric) car does the customer own a car (YES/NO) save_acct does the customer have a saving account (YES/NO) current_acct does the customer have a current account (YES/NO) mortgage does the customer have a mortgage (YES/NO) pep did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)
Information like ID is usually useless for data mining tasks, so we remove it. In area 5, select the property “ID” and click “Remove”. Save the new data set once and use UltraEdit to open the ARFF file. We found that in the attribute declaration section, WEKA already selected the appropriate type for each attribute.
We know that there are algorithms that can only handle the case where all the attributes are typed. At this point, we need to discretize numeric attributes. There are three numerical variables in this data set, namely “age”, “income” and “children”. Children has only four values: 0,1,2,3. At this time, we directly modify the ARFF file in UltraEdit and change @attribute children numeric to @attribute children {0,1,2,3}. Open “bank-data.arff” again in “Explorer” and check if the “Type” displayed in area 6 has changed to “Nominal” after selecting the “children” property.
We need to use the Filter named “Discretize” in WEKA to Discretize “age” and “income”. In area 2 midpoint “Choose”, a “Filter” trees, step by step to find “weka. Filters. Unsupervised. Attribute. Discretize”, and click. If you cannot close the tree, just click the “Explorer” panel somewhere outside the tree. The text box next to “Choose” should now say “Discretize -b 10-M-0.1-r first-last”. Clicking this text box brings up a new window to modify the discretized parameters. We are not going to discretize all attributes, just for the first and fourth attributes (see the number to the left of field 5 attribute names), so change the attributeIndices right side to “1,4”. Bins were planned to split both properties into 3 segments, so the words’ 3 ‘were changed to’ bins’. There is no need to change the other boxes, you can click “More” to see their meaning. Click “OK” to return to “Explorer” and you can see that “age” and “income” have been discretized for the attributes of the component type. To give up discretization, click “Undo” in area 2. If you’re not happy with an obscure identifier like “(-INF-34.333333]”, you can use UltraEdit to open a saved ARFF file and replace all “(-INF-34.333333]” with “0_34”. Other flags are similarly replaced manually.
The data set obtained by the above operation is saved as bank-data-final.arff.
—- from maya.cs.depaul.edu/~classes/ec…
5. Association rules (Shopping basket analysis)
Note: At present, WEKA’s association rule analysis function is only for demonstration purposes and is not suitable for mining large data sets.
We intend to analyze association rules for the previous “bank-data” data. After opening “bank-data-final.arff” with “Explorer”, switch to the “Associate” TAB. Apriori algorithm is used for default association rule analysis, but we use this algorithm, but click the text box to the right of “Choose” to modify the default parameters, and click “More” in the popup window to see the description of each parameter.
First of all, let’s review Apriori. For an association rule L->R, we often use Support and Confidence to measure its importance. Rules support degree is used to estimate in a basket at the same time observe the probability P L and R (L, R), and the rules of the confidence level is estimated shopping bar appeared when L will now R also conditional probability P (R) | L. The goal of association rules is to generate rules with high support and confidence. Instead of confidence, there are several similar measures to measure how relevant rules are: Lift. : P(L,R)/(P(L)P(R)) Lift=1 indicates that L and R are independent. The larger the number, the more it suggests that the presence of L and R in the same basket is not an accident. Leverage: P(L,R) -p (L)P(R) It has a similar meaning to Lift. L and R are independent when Leverage=0. The greater the Leverage, the closer the relationship between L and R. (3) I went away from the restaurant. (3) I went away from the restaurant. (3) I went away from the restaurant. R)/P(L,! (R)! R means that R did not happen.) Conviction is also used to measure the independence of L and R. We also want this value to be as large as possible, as shown by its relation to lift (invert R, substitute the lift formula and invert). Note that L and R are symmetric when using Lift and Leverage, but Confidence and Conviction are not.
We now plan to mine association rules with support between 10% and 100%, lift values over 1.5, and lift values in the top 100. We set “lowerBoundMinSupport” and “upperBoundMinSupport” to 0.1 and 1, “metricType” to lift, “minMetric” to 1.5, and “numRules” to 100. Keep the default values for other options. After “OK”, click “Start” in “Explorer” to run the algorithm, and the data set summary and mining results will be displayed in the right window.
Here are the lifted top 5 rules I dug up. Best rules found: \1. age=52_max save_act=YES current_act=YES 113 ==> income=43759_max 61 conf:(0.54) < lift:(4.05)> Lev :(0.0 [45] Conv :(1.85) \2. income=43759_max 80 ==> age=52_max save_act=YES current_act=YES 61 conf:(0.76) < lift:(4.05)> Lev :(0.0 [45] conv:(3.25) \3. Income =43759_max current_act=YES 63 ==> age=52_max save_act=YES 61 conf:(0.97) < lift:(3.85)> Lev: 0.0 [45] conv:(15.72) \4. age=52_max save_act=YES 151 ==> income=43759_max current_act=YES 61 conf:(0.4) < Lev :(0.0 [45] conv:(1.49) \5. age=52_max save_act=YES 151 ==> income=43759_max 76 conf:(0.5) < lift:(3.77)> Lev :(0.09) [55] conv:(1.72) for each rule mined, WEKA lists four indicators of their correlation degree.
CLI You can also use the CLI to complete mining tasks. Enter the following commands in the Simlpe CLI module: Java weka. Associations. Apriori options – t directory – path \ bank – data – final. The arff can complete Apriori algorithm. Note The file path after the -t parameter cannot contain Spaces. The option we used earlier was -n 100 -t 1 -c 1.5 -d 0.05 -u 1.0 -m 0.1 -s-1.0. Using these parameters on the command line gives the same result as using the GUI previously. We can also add the “-i” argument to get the set of frequent items with different numbers of items. The command I used was as follows: Java weka. Associations. Apriori N – 100 – T 1-1.5 – C D 0.05-1.0 M – 1.0-0.1 – S U I – T D: \ weka \ bank – data – final arff The mining results are shown at the top, which should look like this file.
—- from maya.cs.depaul.edu/~classes/ec…
6. Classification and regression
There is a reason WEKA places both Classification and Regression on the “Classify” TAB. In both tasks, there is a target attribute (output variable). We want to predict the target based on a set of characteristics (input variables) of a sample (called an instance in WEKA). To do this, we need to have a training data set where the inputs and outputs of each instance are known. By observing examples of training sets, predictive models can be built. With this model, we can make predictions with new output unknown instances. The measure of a model is how accurate its predictions are. In WEKA, the target (output) to be predicted is called the Class attribute, which should be the “Class” from the classification task. In general, our task is classified if the Class attribute is typed, and our task is regression if the Class attribute is numeric.
In the section of selection algorithm, we use C4.5 decision tree algorithm to establish a classification model for bank-data. Let’s look at the original “bank-data.csv” file. The “ID” attribute is definitely not needed. Since the C4.5 algorithm can handle numeric attributes, we do not discretize the component types for each variable as we did earlier with association rules. Nevertheless, we convert the “Children” attribute to the two values “YES” and “NO” of the component type. In addition, our training set only takes half of the original data set instance; Several items are drawn from the other half as instances to be predicted, and their “PEP” attribute is set to missing values. The training set data after these processes can be downloaded here; The forecast set data can be downloaded here.
Let’s open the training set “bank.arff” with “Explorer” to see if it handles as described above. Switch to the “Classify” TAB, click the “Choose” button, and you can see a tree box with a list of classification or regression algorithms. Version 3.5 of WEKA has a “Filter…” below the tree box. Button to filter out inappropriate algorithms based on the characteristics of the dataset. The input properties of our dataset are Binary and numeric, while the Class variable is Binary. So we checked “Binary Attributes”, “Numeric Attributes” and “Binary class”. When you click “OK” and go back to the tree, you can see that some algorithm names have turned red, indicating that they are not working. Select “J48” under “trees”, this is the C4.5 algorithm we need, thankfully it doesn’t turn red. Click the text box to the right of “Choose”, and a new window will pop up to set various parameters for the algorithm. Tap “More” to view the parameter description, and tap “Capabilities” to see what the algorithm works with. Here we leave the parameters as default. Now look at “Test Option” in the left. We did not set up a special test data set. In order to ensure the accuracy of the generated model and avoid overfitting, 10-fold cross validation was necessary to select and evaluate the model. If you don’t understand what cross validation means, Google it.
Modeling result OK, select “cross-validation” and fill in “10” in the “Folds” box. Click the “Start” button to Start the algorithm generating a decision tree model. Soon, a decision tree represented in text, error analysis of the decision tree, and so on appear in “Classifier Output” on the right. At the same time, an item appears in the Results list at the bottom left showing the previous time and the name of the algorithm. If you change the model or change the parameter and Start again, the Results list will have one more entry.
We see that one of the cross-validation results of the “J48” algorithm is Correctly Classified Instances 206 68.6667%, which means that the accuracy of this model is only about 69%. Perhaps we need to manipulate the original attributes, or modify the parameters of the algorithm to improve accuracy. But let’s just ignore that and stick with the model.
Right click on the previous entry in The Results List, and select Visualize Tree from the pop-up menu. A new window displays the Visualize mode decision tree. I recommend maximizing the new window, then right-click and select “Fit to Screen” to see the tree more clearly. Take screenshots after viewing or close them
Here we explain what the “Confusion Matrix” means. = = = Confusion Matrix = = = a, b < – classified as 74 64 | 132 | = YES 30 b = NO this Matrix is said, had “pep” is “YES” as an example, there are 74 is the correct forecast for “YES”, Sixty-four of the wrong predictions were “NO”; Originally “PEP” was an instance of “NO”, with 30 incorrect predictions of “YES” and 132 correct predictions of “NO”. 74+64+30+132 = 300 is the total number of instances, and (74+132)/300 = 0.68667 is exactly the proportion of correctly classified instances. The higher the number on the diagonal of the matrix, the better the prediction.
Model application Now it is time to use the generated model to predict the data set to be predicted. Note that the attributes of the data set to be predicted and the data set to be used for training must be consistent. Add the Class attribute even if you don’t have the value of the Class attribute for the dataset to be predicted. You can set the value of the Class attribute to the missing value on each instance. Select “Supplied Test Set” in “Test Opion” and “Set” becomes the data set to which you want to apply the model. Here is the “bank-new.arff” file. Now, right-click on the newly generated item in the Result List and select Re-Evaluate Model on current test Set. The area showing the results on the right will add something to tell you how the model will perform on this data set. If all of your Class attributes are missing values, these are meaningless, we are concerned with the model’s predicted values on the new data set. Now click on “Visualize Classifier Errors” from the right-click menu, and a new window pops up showing some scatter graphs of forecast errors. Click the “Save” button in this new window to Save an Arff file. When you open this file, you can see that there is an additional attribute (predictedPEP) in the second-to-last position, and the value on this attribute is what the model predicts for each instance.
Using the COMMAND line (recommended) Although it is convenient to view results and set parameters using the GRAPHICAL user interface, the most direct and flexible method for modeling and application is to use the command line. Open the “Simple CLI” module and use the “J48” algorithm as above: Java weka. Classifiers. Trees. J48-0.25 – M – 2 C t directory – path \ bank arff – d directory – path \ bank model with parameter “- C 0.25 “and” -m 2 “are the same as in the graphical interface. “-t” is followed by the full path to the training data set (including directory and file name), and “-d” is followed by the full path to save the model. Attention! Here we can save the model. After entering the above command, the tree model and error analysis obtained will be displayed above the “Simple CLI” and can be copied and saved in a text file. The error is given by applying the model to the training set. The format of the command used to apply this model to “bank-new.arff” is: Java weka. Classifiers. Trees. J48 – p 9 – l directory – path \ bank model – T directory – path \ bank – new arff “-p 9 “says that the true value of the property to be predicted in the model exists in the ninth (” PEP”) property, where they are all unknown and therefore replaced with missing values. “-l” is followed by the full path to the model. “-t” is followed by the full path of the data set to be predicted. 0 YES 0.75? 0 YES 0.75? 0 YES 0.75? 1 NO 0.7272727272727273? 2 YES 0.95? 3 YES 0.8813559322033898? 4 NO 0.8421052631578947? . The first column is the “Instance_number” we mentioned, the second column is the “predictedPEP”, and the fourth column is the original “PEP” value in the “bank-new.arff”. Missing value). The third column is the confidence of the predicted results. For instance 0, we are 75% sure that its “PEP” value will be “YES”, and for instance 4 we are 84.2% sure that its “PEP” value will be “NO”. We see at least two benefits to using the command line. One is to save the model so that it can be applied without having to re-model every time new data comes in for prediction. The other is to give a confidence level for the predicted results. We can selectively adopt the predicted results, for example, only consider those results with a confidence level above 85%.
—- from maya.cs.depaul.edu/~classes/ec…
Cluster analysis
Principle and Implementation
The “class” in cluster analysis is different from the “class” in the previous classification. A more accurate translation of cluster should be “cluster”. The task of clustering is to allocate all instances to several clusters, so that the instances of the same cluster gather around the center of a cluster, and the distance between them is relatively close. However, the distance between different cluster instances is far. For instances characterized by numerical attributes, this distance usually refers to the Euclidean distance. Now we make clustering analysis on the previous “Bank data” and use the most common K-means algorithm. Below, we briefly describe the steps of k-means clustering. The k-means algorithm first randomly assigns K cluster centers. Then: 1) Assign each instance to the cluster center nearest to it, and get K clusters; 2) Calculate the mean values of all instances in each cluster and take them as the new cluster centers of each cluster. Repeat 1) and 2) until the locations of K cluster centers are fixed and the cluster allocation is fixed.
The above k-means algorithm can only deal with numeric attributes, and when it encounters categorical attributes, it needs to change them into several attributes with values of 0 and 1. WEKA will automatically perform this type-to-numeric transformation, and WEKA will automatically standardize the numeric data. Therefore, for the original data “bank-data.csv”, the only preprocessing we did was to delete the attribute “ID”, save it in ARFF format, and modify the attribute “children” as sub-type. The resulting data file is “bank.arff” with 600 instances.
Use “Explorer” to open the “bank.arff” you just got and switch to “Cluster”. Click the “Choose” button to select “SimpleKMeans”, which is the algorithm for realizing k-means in WEKA. Click on the text box next to it and change the “numClusters” to 6, indicating that we want to cluster the 600 instances into 6 groups, K=6. The following “seed” parameter is to set a random seed, which generates a random number that is used to obtain the location of the K cluster centers given for the first time in the K-means algorithm. Let’s just make it 10 for now. Select “Use Training set” in “Cluster Mode” and click “Start” button to observe the clustering result given by “Clusterer Output” on the right. You can also right-click the Result in the Result List in the lower left corner and View the Result in separate window in a new window.
First of all, we noticed the following line in the result: Within cluster sum of squared errors: 1604.7416693522332, which is the standard for evaluating the quality of clustering. The smaller the value is, the smaller the distance between the same cluster instances is. Maybe you get a different number; In fact, if you change the “seed” parameter, you might get a different value. We should try several more seeds and adopt the result with the smallest number. For example, if I tell “seed” to pick 100, I get Within cluster sum of squared errors: 1555.6241507629218 I should pick the latter one. Of course, try a few more seeds, and the number may be even smaller.
The following “Cluster centroids:” lists the location of each Cluster center. For a numerical attribute, the cluster center is its Mean. The type is its Mode, which means that this attribute has the most instances of Mode values. For numerical attributes, the standard deviation (Std Devs) in each cluster is given.
The final Clustered Instances are the number and percentage of Instances in each cluster.
To Visualize the cluster results, right click on the Result listed in the “Result List” in the lower left and click “Visualize Cluster Assignments”. The pop-up window shows the scatter diagram of each instance. The top two boxes are to select the abscissa and ordinate, and the second line “color” is the basis for the scatter plot coloring. By default, the instance is marked with different colors according to the different Cluster “Cluster”. You can click “Save” here to Save the clustering result as an ARFF file. In this new ARFF file, the “instance_number” attribute represents the number of an instance, and the “Cluster” attribute represents the Cluster in which the instance resides as given by the clustering algorithm.
—- from maya.cs.depaul.edu/~classes/ec…
8. Weka Learning Related materials:
(1) the first is the Weka’s official website: www.cs.waikato.ac.nz/ml/weka/ (2) the Weka API available in Chinese: download.csdn.net/detail/mono… ③Weka programming Manual: The Weka Manual can be found on the official website. (4) the Weka q&a community: weka.wikispaces.com/Frequently+… (5) the Weka start guide weka.wiki.sourceforge.net/Use+Weka+in… There is also a lot of data in the developer q&A community, such as Stack Overflow, with a lot of great answers. Basically, the above information is enough to learn Weka, this tutorial mainly explains the USE of Weka API, the use of GUI official documentation is more detailed, but also easy to explore.
9. Weka secondary development:
If Weka is used for API secondary development, it is strongly recommended to use Maven for project management so that the required dependency packages can be imported easily. Weka development need rely on package can be found on the following url: mvnrepository.com/artifact/nz… Of course there is still a big difference between Weka3.7 and Weka3.6, Weka3.7 removed third-party packages like LibSVM, SMOTE, but if you use these in use you can still find them in the Maven repository. Using Intellij for development is a good option, as is Eclipse, and once you’ve done all that, you can do the actual development.
10. Weka source code import
-
Download the source code
- Weka-src. jar will be installed in the installation directory after downloading the installation file. www.cs.waikato.ac.nz/ml/weka/dow…
- Download via SVN: SVN. CMS. The waikato. Ac. Nz/SVN/weka/tr… SVN tools such as TortoiseSVN are required on the TortoiseSVN
-
Import the eclipse
-
- Prepare the source code. Find weka-src.jar in the weka installation directory and unzip it to a weka directory. Contains the lib, SRC, meta-INF, resources folders, and several other files;
- Create a Java Project named weka under Eclipse and create a package named weka under SRC. (Someone suggested using Maven for project management)
- Import –>File System–> select… /weka/ SRC /main/ Java /weka
- Biuldpath –>add external jar–> import lib jar package;
- Other content, such as Resources and META-INF, can be imported selectively
- Weka.gui.main succeeded;
-
Reference: blog.csdn.net/hanma602/ar…
Reprinted from:
WEKA use tutorial: blog.csdn.net/yangliuy/ar…
Basic use – weka explorer interface is introduced: blog.csdn.net/u010372981/…