Stream Mining Using Statistical Relational Learning

Stream Mining Using Statistical Relational Learning Swarup Chandra, Justin Sahs, Latifur Khan Bhavani Thuraisingham and Charu Aggarwal* Department of Computer Science, The University of Texas at Dallas *IBM T J Watson Research Center, NY, USA 12/15/2014

Introduction Streaming Data Classification Classification of data instance that occur continuously in a stream, generated by a non-stationary process. Web Search Social Media Sensors Communication

Stream Classification What are the challenges for classification in a stream? 1 Concept Drift: Data distribution changes over time. 2 Storage Limitation: Insufficient space to store all data.

Motivation Current Approach Chunk based ensemble model with adaptive learning. 1 1 Al-Khateeb, T., Masud, M. M., Khan, L., Aggarwal, C. C., Han, J., & Thuraisingham, B. M. (2012, December). Stream Classification with Recurring and Novel Class Detection Using Class-Based Ensemble. In ICDM (pp. 31-40).

Motivation Current Approach Chunk based ensemble model with adaptive learning. 1 However... They assume data to be independent and identically distributed, which may not always hold. 1 Al-Khateeb, T., Masud, M. M., Khan, L., Aggarwal, C. C., Han, J., & Thuraisingham, B. M. (2012, December). Stream Classification with Recurring and Novel Class Detection Using Class-Based Ensemble. In ICDM (pp. 31-40).

Motivation Examples Examples

Motivation Examples Web Links

Motivation Examples Source: Wikipedia

Motivation Examples Entity Name1 Entity Name2 PK attribute name1 PK attribute name4 attribute name2 attribute name2 attribute name3 Relational Database Entity Name3 attribute name5 PK attribute name6 attribute name3 attribute name7

Motivation What are we looking for?

Motivation What are we looking for? 1 Leverage existing domain knowledge and relationships to perform better classification. 2 Handle uncertainty in class distribution. 3 Overcome challenges of stream classification. Overview Express domain knowledge using a language from the field of Statistical Relational Learning called Markov Logic Network.

Markov Logic Network Language that combines First Order Logic and Markov Networks. 2 MLN good for... Representing domain knowledge in FOL Relational Data Compact template structure Learning Task Learn weights associated with the FOL formulas from data. 2 Domingos, P., & Lowd, D. (2009). Markov logic: An interface layer for artificial intelligence. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1), 1-155.

Stream Mining using MLN Challenges

Stream Mining using MLN Challenges 1 Finite Domain Size

Stream Mining using MLN Challenges 1 Finite Domain Size 2 Choice of MLN Formula Large and complex formula increases computational time. Small formula may not sufficiently capture relations.

Stream Mining using MLN Challenges 1 Finite Domain Size 2 Choice of MLN Formula Large and complex formula increases computational time. Small formula may not sufficiently capture relations. 3 Weight learning may be too slow. 4 Chunk size. Large size increases computational time. Small size may not capture concept drift well.

Stream Mining using MLN Addressing these challenges

Stream Mining using MLN Addressing these challenges 1 Propose a single model incremental weight learning.

Stream Mining using MLN Addressing these challenges 1 Propose a single model incremental weight learning. 2 Propose selective weight learning.

Stream Mining using MLN Addressing these challenges 1 Propose a single model incremental weight learning. 2 Propose selective weight learning. 3 Discretize the domain. 4 Limit the number of predicates in each formula. 5 Empirically estimate best chunk size.

Our Approach Basic Algorithm

Our Approach Basic Algorithm Domain Knowledge Initial MLN User Data Stream

Discretize Our Approach Basic Algorithm Data Stream Chunk1 dchunk1

Weight Learning Discretize Our Approach Basic Algorithm Data Stream Chunk1 dchunk1 Initial MLN MLN1

Discretize Inference Our Approach Basic Algorithm Data Stream Chunk2 dchunk2 MLN1 Error

Weight Learning Discretize Our Approach Basic Algorithm Data Stream Chunk2 dchunk2 MLN1 MLN2

Discretize Inference Our Approach Basic Algorithm Data Stream Chunk3 dchunk3 MLN2 Error

Weight Learning Discretize Our Approach Basic Algorithm Data Stream Chunk3 dchunk3 MLN2 MLN3

Selective Learning Do we need weight learning at every chunk?

Selective Learning Do we need weight learning at every chunk? Weight learning is expensive. Change in data distribution may not be significant.

Selective Learning Do we need weight learning at every chunk? Weight learning is expensive. Change in data distribution may not be significant. Kullback-Leibler Distance

Selective Learning Do we need weight learning at every chunk? Weight learning is expensive. Change in data distribution may not be significant. Kullback-Leibler Distance P M (z): Current chunk probability distribution of attribute a = z. P Q (z): Overall probability distribution of attribute a = z. KL a = z D (P M(z) P Q (z)) log P M(z) P Q (z) Perform weight learning if d a is true. d a = KLprev a KL curr a KL prev a = { True : d a > Threshold False : d a Threshold, a

Our Approach Algorithm with Selective Learning

Discretize Our Approach Algorithm with Selective Learning Data Stream Chunk1 dchunk1 Calculate KL KL1 = {KLA, KLB} Current KL = KL1 Previous KL = - ChunkDist = dchunk1 OverAllDist = ChunkDist

Weight Learning Discretize Our Approach Algorithm with Selective Learning Data Stream Chunk1 dchunk1 MLN1 Initial MLN Current KL = KL1 Previous KL = - ChunkDist = dchunk1 OverAllDist = ChunkDist

Discretize Our Approach Algorithm with Selective Learning Data Stream Chunk2 dchunk2 False Calculate Distance Calculate KL KL2 = {KLA, KLB} Current KL = KL2 Previous KL = KL1 ChunkDist = dchunk2 OverAllDist = ChunkDist + OverAllDist

Discretize Our Approach Algorithm with Selective Learning Data Stream Chunk2 dchunk2 MLN1 Current KL = KL2 Previous KL = KL1 ChunkDist = dchunk2 OverAllDist = ChunkDist + OverAllDist

Discretize Our Approach Algorithm with Selective Learning Data Stream Chunk3 dchunk3 True Calculate Distance Calculate KL KL3 = {KLA, KLB} Current KL = KL3 Previous KL = KL2 ChunkDist = dchunk3 OverAllDist = ChunkDist + OverAllDist

Weight Learning Discretize Our Approach Algorithm with Selective Learning Data Stream Chunk3 dchunk3 MLN1 MLN3 Current KL = KL3 Previous KL = KL2 ChunkDist = dchunk3 OverAllDist = ChunkDist + OverAllDist

DataSets Dataset 3 Attributes Count Classification Total Discrete Real-valued Problem? ForestCover 55 45 10 forest cover type Airline 8 6 2 schedule delay Poker 11 11 0 poker hand SyntheticLED 8 8 0 digit displayed 100,000 instances in each dataset 3 http://moa.cms.waikato.ac.nz/datasets/

MLN Formulas Domain Knowledge How did we embed domain knowledge? Relationship between class and other attributes. Example: ForestCover Dataset Class attribute: Cover Type Neota (area 2) would have spruce/fir (type 1) 4 4 https://archive.ics.uci.edu/ml/datasets/covertype

MLN Formulas Domain Knowledge How did we embed domain knowledge? Relationship between class and other attributes. Example: ForestCover Dataset Class attribute: Cover Type Neota (area 2) would have spruce/fir (type 1) 4 WildernessArea(o, 2) 4 https://archive.ics.uci.edu/ml/datasets/covertype

MLN Formulas Domain Knowledge How did we embed domain knowledge? Relationship between class and other attributes. Example: ForestCover Dataset Class attribute: Cover Type Neota (area 2) would have spruce/fir (type 1) 4 WildernessArea(o, 2) CoverType(o, 1) 4 https://archive.ics.uci.edu/ml/datasets/covertype

Experiments Error Analysis Weight learning and inference using the Alchemy 5 toolkit. Compare classification error against state-of-the-art stream classifiers using the MOA 6 toolkit. 5 http://alchemy.cs.washington.edu/ 6 http://moa.cms.waikato.ac.nz/

Results Error Analysis Classifier ForestCover (500) Dataset (Chunk Size) Airline Poker (500) (750) SyntheticLED (750) MLN 13.59 35.04 8.86 25.94 HMLN 26.87 37.4 - - SluiceBox 26.9 45.2 49.9 89.9 Hoeffding Tree 13.6 40.97 48.27 28.4 NaïveBayes 22.0 41.07 50.2 27.6 Perceptron 21.4 46.43 48.27 26.8 SGD 6.6 44.13 47.87 87.73 SingleClassifierDrift 12.8 41.33 48.27 29.07 OzaBoost- Adwin 6.0 41.27 51.6 28.53 Accuracy- Updated- Ensemble2 7 7.8 35.7 48.13 27.33 7 Brzezinski, D., & Stefanowski, J. (2014). Reacting to different types of concept drift: The accuracy updated ensemble algorithm. Neural Networks and Learning Systems, IEEE Transactions on, 25(1), 81-94.

Addressing a Limitation Major Limitation Weight learning is slow due to multiple iterations.

Addressing a Limitation Major Limitation Weight learning is slow due to multiple iterations. Dataset Without SL With Selective Learning (SL) Time (s) Error (%) Threshold (%) Time (s) Error (%) ForestCover 72.28 13.59 5 66.6 13.64 10 61.4 13.81 Airline 8.09 35.04 5 7.94 34.96 10 7.32 35.09 Poker 278.6 8.86 5 189.77 9.06 10 189.95 9.06 SyntheticLED 85.03 25.94 5 60.19 25.99 10 59.9 25.99

Conclusion Adaptation of Markov Logic Network for stream mining. Evaluate incremental learning approach. Use of domain knowledge outperforms state-of-the-art approaches.

Thank you Datasets and MLN s available at http://utdallas.edu/~swarup.chandra/