Database Systems Group Prof. Dr. Thomas Seidl. Topics. Praktikum Big Data Science SS 2017

Similar documents
Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Attributed Social Network Embedding

Australian Journal of Basic and Applied Sciences

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Variations of the Similarity Function of TextRank for Automated Summarization

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Welcome to. ECML/PKDD 2004 Community meeting

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Rule Learning With Negation: Issues Regarding Effectiveness

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Lecture 1: Machine Learning Basics

Text-mining the Estonian National Electronic Health Record

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Probabilistic Latent Semantic Analysis

Term Weighting based on Document Revision History

A Case Study: News Classification Based on Term Frequency

Learning From the Past with Experiment Databases

CS Machine Learning

Comment-based Multi-View Clustering of Web 2.0 Items

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Linking Task: Identifying authors and book titles in verbose queries

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Matrices, Compression, Learning Curves: formulation, and the GROUPNTEACH algorithms

AQUA: An Ontology-Driven Question Answering System

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Mining Association Rules in Student s Assessment Data

Efficient Online Summarization of Microblogging Streams

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Customized Question Handling in Data Removal Using CPHC

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

TopicFlow: Visualizing Topic Alignment of Twitter Data over Time

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Team Formation for Generalized Tasks in Expertise Social Networks

Cross Language Information Retrieval

CSL465/603 - Machine Learning

arxiv: v2 [cs.ir] 22 Aug 2016

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Rule Learning with Negation: Issues Regarding Effectiveness

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Georgetown University at TREC 2017 Dynamic Domain Track

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Bayesian Learning Approach to Concept-Based Document Classification

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Matching Similarity for Keyword-Based Clustering

Generative models and adversarial training

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

On-Line Data Analytics

Seminar - Organic Computing

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

As a high-quality international conference in the field

On document relevance and lexical cohesion between query terms

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

RESEARCH METHODS AND LIBRARY INFORMATION SCIENCE

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Machine Learning and Development Policy

A Vector Space Approach for Aspect-Based Sentiment Analysis

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Conference Presentation

TextGraphs: Graph-based algorithms for Natural Language Processing

Geospatial Visual Analytics Tutorial. Gennady Andrienko & Natalia Andrienko

Mining Student Evolution Using Associative Classification and Clustering

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

BUSINESS INTELLIGENCE FROM WEB USAGE MINING

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Axiom 2013 Team Description Paper

Organizational Knowledge Distribution: An Experimental Evaluation

Word Segmentation of Off-line Handwritten Documents

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Lecture 1: Basic Concepts of Machine Learning

arxiv: v2 [cs.cv] 30 Mar 2017

HLTCOE at TREC 2013: Temporal Summarization

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

arxiv: v1 [cs.cl] 2 Apr 2017

UniConnect: A Hosted Collaboration Platform for the Support of Teaching and Research in Universities

Inside the mind of a learner

Automatic document classification of biological literature

Speech Recognition at ICSI: Broadcast News and beyond

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Handling Concept Drifts Using Dynamic Selection of Classifiers

Short Text Understanding Through Lexical-Semantic Analysis

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Model Ensemble for Click Prediction in Bing Search Ads

Online Updating of Word Representations for Part-of-Speech Tagging

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Postprint.

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Writing Research Articles

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Transcription:

Database Systems Group Prof. Dr. Thomas Seidl Topics

Overview Topics 1. Subspace Clustering 2. Search Engine 3. Graph Learning 4. Small Data Groups 2

Topic 1: Subspace Clustering In KDD1 and KDD2: learned several clustering models and algorithms Density based, partitioning, hierarchical clustering Subspace clustering (e.g. SUBCLU, CLIQUE) Projected clustering (e.g. PROCLUS. PREDECON) Correlation clustering (e.g 4C, CASH) In Big Data Management & Analytics: Learned about map-reduce Had map-reduce variant of k-means 3

Topic 1: Subspace Clustering P3C + MR A projected/subspace clustering algorithm Suitable for large data sets in high-dimensional spaces v u S 1 v l Extends P3C by map-reduce Source: Fries, S., Wels, S., & Seidl, T. (2014). Projected Clustering for Huge Data Sets in MapReduce. International Conference on Extending Database Technology, 49 60. v l S 2 v u a 1 4

Topic 1: Subspace Clustering Primary objectives: Read and unterstand the P3C + MR paper and write a documentation of how the algorithm works Identify major steps/tasks of the algorithm Implement the described map-reduce variant Evaluate the algorithm Create a UI in which the algorithm can be executed on input files (e.g. *.csv) and returns a visualization 5

Topic 2: Search Engine Internet has a huge amount of text (and information) How can we retrieve the information we are looking for? => Search Engine Implement our own Search Engine using Apache Flink 6

Topic 2: Search Engine Implement a new search engine in a specific context StackOverflow Patent Dataset Another dataset? Apply standard Information Retrieval algorithms (e.g. BM25 Score) BM25 d j, q 1:N N TF q i, d j k + 1 = IDF q i i=1 TF q i, d j + k 1 b + b d j L Use Information Extraction to find synonyms and improve the search engine Implement Question Answering (e.g. AskMSR) Search for the person who can be asked to answer this question, if no result satisfies the user 7

Topic 2: Search Engine Expected outcome: Search algorithm (Okapi BM25) implemented in Flink Query website Information Retrieval/Extraction in Flink Question answering 8

Topic 3: Graph Learning Lots of interesting data has an intrinsic graph structure, e.g. Social networks, sensor networks, citation networks,... Typical graph learning tasks include Node classification, link prediction, content recommendation, For these learning taks, it is useful to first learn a latent vector space embedding of the nodes based on the graph structure Learned node vectors can further be combined with other node features 9

Topic 3: Graph Learning Deepwalk Based on word embedding algorithm word2vec from NLP Word representations are learned based on their context (Distributional Hypothesis - words in similar contexts are similar): Adaptation to learn graph node embeddings by sampling random walks to form sentences how to stop puppy from barking barking dog stole my sleep 10

Topic 3: Graph Learning Goals Get familiar with Flink s graph API Gelly Prepare the Deepwalk algorithm and related theory Implement the Deepwalk algorithm in Apache Flink Improve and optimize your implementation (and try different variations) Evaluate your implementations (Implement a stream version of the algorithm) Think of an interesting use case Apply your node embedding algorithm and solve a subsequent learning task on a real dataset (e.g. embedding of web graph and recommendation of similar websites) Prepare a demo framework for your use case 11

Topic 3: Graph Learning Resources Papers Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016. Intuition on word2vec: https://deeplearning4j.org/word2vec Datasets https://snap.stanford.edu/data/index.html http://konect.uni-koblenz.de/ 12

Topic 4: Small Data Why should we consider distributed computation for small data? Dataset fits in one machine Model can be learned in acceptable time on one core Find the best solution for the problem is tricky: Different models (e.g. different classification algorithms) Each model has different hyperparameters (grid search) Cross-validation is often necessary for small data Variance (e.g due to the random parameters initialization) Apply Map-Reduce to find the best model 13

Topic 4: Small Data Solve real live problem: Predict traffic flow in small road network Given current travel time, predict average travel time in one hour Given current tollgate traffic volume, predict average traffic volume in one hour KDD Cup 2017 (last submission possibility June 1st ) 14

Topic 4: Small Data Expected outcome: Selection of models for traffic flow prediction problem Documentation of models and explanation of hyperparameters Model selection framework in Flink GUI for model selection framework for arbitrary dataset Best model for traffic flow prediction problems 15