Thursday, February 19, 2009

R Interface to Weka for Statistical Learning

Weka is a Java based collection of machine learning algorithms for data mining tasks that contain tools for data pre-processing, classification, regression, clustering, association rules, and visualization at http://www.cs.waikato.ac.nz/~ml/index.html. An example of classification can be seen with the following applet at http://www.cs.technion.ac.il/~rani/LocBoost/.

R is an open source application for statistical computing and graphics that I use for modeling commodity prices and have discussed in publications located at TCW. Together, they form the R/Weka interface at http://cran.r-project.org/web/packages/RWeka/index.html . The working paper on this project is at http://epub.wu-wien.ac.at/dyn/virlib/wp/eng/mediate/epub-wu-01_ba6.pdf?ID=epub-wu-01_ba6 . I am doing some work in Weka because of its preference in computer science curriculums. A comparision of Weka and R can be found at http://74.125.47.132/search?q=cache:BNwL-HtkC4IJ:wiki.pentaho.com/download/attachments/3801462/ComparingWekaAndR.pdf%3Fversion%3D1+r+and+weka&hl=en&ct=clnk&cd=8&gl=us .

A good introduction to the interface is at http://statmath.wu-wien.ac.at/~zeileis/papers/DSC-2007a.pdf . For example, they show through R a list of the Weka interfaces:

R> list_Weka_interfaces()
$Associators
[1] "Apriori" "Tertius"
$Classifiers
[1] "AdaBoostM1" "Bagging" "DecisionStump" "IBk"
[5] "J48" "JRip" "LBR" "LMT"
[9] "LinearRegression" "Logistic" "LogitBoost" "M5P"
[13] "M5Rules" "MultiBoostAB" "OneR" "PART"
[17] "SMO" "Stacking"
$Clusterers
[1] "Cobweb" "DBScan" "FarthestFirst" "SimpleKMeans"
[5] "XMeans"
$Filters
[1] "Discretize" "Normalize"

I think it is worth the effort to combine these two technolgies in a middeware web service component to enhance existing AI applications. See our pubs for how to interface C#.NET with R.

No comments: