BIOMEDICAL research is increasingly dependent on

Size: px
Start display at page:

Download "BIOMEDICAL research is increasingly dependent on"

Transcription

1 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, Classification of protein-protein interaction full-text ocuments using text an citation network features Artemy Kolchinsky 1,2, Alaa Abi-Haiar 1,2, Jasleen Kaur 1, Ahme Abeen Hame 1 an Luis M. Rocha,1,2 Abstract We participate (as Team 9 in the Article Classification Task of the Biocreative II.5 Challenge: binary classification of fulltext ocuments relevant for protein-protein interaction. We use two istinct classifiers for the online an offline challenges: (1 the lightweight Variable Trigonometric Threshol (VTT linear classifier we successfully introuce in BioCreative 2 for binary classification of abstracts, an (2 a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemente the supplie training ata with full-text ocuments from the MIPS atabase. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top performing submission in this task, taking into account the rank prouct of the Area Uner the interpolate precision an recall Curve, Accuracy, Balance F-Score, an Matthew s Correlation Coefficient performance measures. The novel citation network classifier for the biomeical text mining omain, while not a top performing classifier in the challenge, performe above the central tenency of all submissions an therefore inicates a promising new avenue to investigate further in bibliome informatics. Inex Terms Text Mining, Literature Mining, Binary Classification, Protein-Protein Interaction, Citation Network. 1 BACKGROUND AND DATA BIOMEDICAL research is increasingly epenent on the automatic analysis of atabases an literature to etermine correlations an interactions amongst biochemical entities, functional roles, phenotypic traits an isease states. The biomeical literature is a large subset of all ata available for such inferences. Inee, the last ecae has witnesse an exponential growth of metabolic, genomic an proteomic ocuments (articles being publishe [1]. Pubme [2] encompasses a growing collection of more than 18 million biomeical articles escribing all aspects of our collective knowlege about the bio-chemical an functional roles of genes an proteins in organisms. Biomeical literature mining is a fiel evote to integrating the knowlege currently istribute in the literature an a large collection of omain-specific atabases [3], [4]. It helps us tap into the biomeical collective knowlege (the bibliome, an uncover new relationships an interactions inuce from global information but unreporte in iniviual experiments [5]. The BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology challenge evaluation is an effort to enable comparison of various approaches to literature mining. Its greatest value, perhaps, is that it consists of a community-wie effort, leaing many ifferent groups to test their methos against a common set 1 School of Informatics, Iniana University, USA, 2 FLAD Computational Biology Collaboratorium, Instituto Gulbenkian e Ciência, Portugal. Corresponing author. rocha@iniana.eu of specific tasks, thus resulting in important benchmarks for future research [6], [7]. In most literature or text mining projects in biomeicine, one nees first to collect a set of relevant ocuments for a given topic of interest such as proteinprotein interaction. But manually classifying articles as relevant or irrelevant to a given topic of interest is very time consuming an inefficient for curation of newly publishe articles [4] an subsequent analysis an integration. The problem of automatic binary classification of ocuments has been explore in several omains such as Web Mining [8], Spam Filtering [9] an Document Classification in general [10], [11]. The machine learning fiel has offere many solutions to this problem [12], [11], incluing methos evote to the biomeical omain [4]. However, in contrast to performance in well-prepare theoretical scenarios, even the most sophisticate solutions ten to unerperform in more realistic situations such as the BioCreative challenge (for example, by over-fitting in the presence of rift between testing an training ata. We participate (as Team 9 in the online an offline parts of the Article Classification Task (ACT of the BioCreative II.5 Challenge, which consiste of the binary classification of full-text ocuments as relevant or nonrelevant to the topic of protein-protein interaction (PPI. In most text-mining projects in biomeicine, one nees first to collect a set of relevant ocuments, typically from information in abstracts. To avance the capability of the community on this essential selection step, binary classification of abstracts was the focus of one of the tasks of the previous Biocreative classification challenge [13]. For this challenge, the objective was instea to clas-

2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, sify full-text ocuments, which allowe us to evaluate the possible aitional value of full-text information in this selection problem. The ACT subtask in BioCreative II.5, in particular, aime to evaluate classification performance between relevant an irrelevant ocuments to PPI. Naturally, tools evelope for ACT have great potential to be applie in many other literature mining contexts. For that reason, we use two very general classifiers which coul easily be applie to other omains an porte to ifferent computer infrastructure: (1 the lightweight Variable Trigonometric Threshol (VTT linear classifier we successfully introuce in the abstract classification task of BioCreative 2 (BC2 [5], an (2 a Naive Bayes classifier using features extracte from the citation network of the relevant literature. We participate in the online submission with our own annotation server implementing the VTT algorithm via the BioCreative MetaServer platform. The Citation Network Classifier (CNC runs were submitte via the offline component of the Challenge. We shoul note that VTT oes not require the use of specific atabases or ontologies, an so can be porte easily an applie to other omains. In aition, since full-text ata contains a wealth of citation information, we evelope an teste the novel CNC on its own an integrate with VTT. We were given 61 PPI-relevant an 558 PPI-irrelevant full-text training ocuments. We supplemente this ata by collecting aitional full-text ocuments appropriately classifie in the previous BC2 training ata [13] as well as in the MIPS atabase [14]. For VTT training purposes, we create two atasets: the first containe exactly 4x558 = 2232 ocuments, where the PPI-relevant set is comprise of 558 ocuments from BC2 plus 558 oversample instances of the 61 relevant ocuments from this challenge. The PPI-irrelevant set is comprise of 558 ocuments from BC2 an the 558 irrelevant ocuments provie with this challenge. The secon training set contains 370 PPI-relevant ocuments extracte from MIPS an 370 ranomly sample irrelevant ocuments from BC2. 2 VARIABLE TRIGONOMETRIC THRESHOLD CLASSIFICATION 2.1 Wor-pair an Entity Features Since classification ha to be performe in real time for the online part of this challenge, we use the lightweight VTT metho we previously evelope [5] for Biocreative 2. This metho, loosely inspire by the spam filtering system SpamHunting [15], is base on computing a linear ecision surface (etails below from the probabilities of wor-pair features being associate with relevant an irrelevant ocuments in the training ata [5]. A reason for the lightweight nature of VTT is that such worpair features can be compute from a relatively small number of wors. We use only the top 1000 wors W obtaine from the prouct of the ranks of the TF.IDF measure [16] average over all ocuments per wor w, ptn ptp Fig SP features with largest S(w i, w j on the p T P /p T N plane; size of font proportional to value of S(w i, w j an a score S(w = p T P (w p T N (w that measures the ifference between the probabilities of occurrence in relevant (p T P (w an irrelevant (p T N (w training set ocuments (after removal of stop wors 1 an Porter stemming [17]. All incoming full-text ocuments were converte into orere lists of these 1000 wors, w W, in the sequence of occurrence in the text. The simplifie vector representation an pre-processing of incoming full-text ocuments makes this metho lightweight an appropriate for the online part of this challenge. Wors with the highest score S ten to be associate with either positive or negative abstracts an are assume to be goo features for classification. Since in this challenge we were ealing with full-text ocuments, rather than abstracts as in the previous BC2 challenge, in aition to the S score we also use the TF.IDF rank to select the best wor features. Specifically, we use the rank prouct [18] of TF.IDF with the S score, which resulte in better (k-fol classification of the training ata than using either score alone. The top 15 wors were: immunoprecipit, 2gpi, lysat, transfect, interact, omain, plasmi, vector, mutant, fusion, bea, antiboi, pacrg, two-hybri, yeast. From wor set W, we compute short-winow (SP an long-winow (LP wor-pair features (w i, w j. SP refer to wor-pair features comprise of ajacent wors in 1. The list of stopwors remove: i, a, about, an, are, as, at, be, by, for, from, how, in, is, it, of, on, or, that, the, this, to, was, what, when, where, who, will, the, an, we, were. Notice that wors with an between were kept.

3 p(x p(x IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, TABLE 1 Top 10 SP an LP wor-pair features ranke by S score SP w i, w j P T P P T N S interact,between protein,interact fusion,protein cell,transfect interact,protein accept, cell,lysat bin,omain transfect,cell yeast,two-hybri LP w i, w j P T P P T N S interact,interact us,interact between,interact shown,interact interact,bin suggest,interact protein,immunoprecipit interact,protein assai,interact omain interact x-uniqueabnerproteincounts the orere lists that represent ocuments 2 ; the orer in which wors occur is preserve, therefore (w i, w j (w j, w i. LP features refer to wor-pair compose of wors that occur within 10 wors of one another in the orere lists; in this case, the orer in which wors occur is not important, therefore (w i, w j = (w j, w i. We also compute the probability that such wor-pairs appear in a positive or negative ocument: p T P (w i, w j an p T N (w i, w j, respectively. Figure 1 epicts the 1000 SP features with largest S(w i, w j = p T P (w i, w j p T N (w i, w j plotte on a plane where the horizontal axis is the value of the probability of occurrence in a relevant ocument, p T P (w i, w j, an the vertical axis is the value of the probability of occurrence in an irrelevant ocument p T N (w i, w j ; we refer to this as the p T P /p T N plane. Table 1 lists the top 15 SP an LP wor pairs for score S(w i, w j. In our previous application of this metho in the BC2 challenge [5], we use as an aitional feature the number of proteins mentione in abstracts, as ientifie by an entity recognition algorithm such as ABNER [19]. However, since in this challenge we were ealing with full-text ocuments, it was not clear if such relevant entity counts woul help the classifier s performance as much as they i when classifying abstracts in BC2 especially since ABNER itself is traine only on abstracts. Therefore, we focuse on counting entity occurrences in specific portions of ocuments such as the abstract, the boy, figure captions, table captions, as well as combinations of these. In aition to protein mentions recognize by ABNER, we teste many other entities 2. Notice that the orere lists representing ocuments contain only wors in set W x-uniqueabnerproteincounts Fig. 2. Comparison of the counts of protein mentions as ientifie by ABNER in istinct passages of ocuments in the training ata. Top figure epicts the counts of ABNER protein mentions in the boy section, whereas bottom figure epicts the counts of ABNER protein mentions in figure captions an abstracts. In these figures, the horizontal axis represents the number of mentions x, an the vertical axis the probability p(x of ocuments with at least x mentions. The blue circles enote ocuments labele relevant, while the re squares enote ocuments labele irrelevant; the green triangles enote the ifference between blue an re lines. ientifie by ABNER an an ontology-base annotator (which matche terms in text to PPI terms extracte from the Gene Ontology, the Protein-Protein Interaction Ontology, the Protein Ontology, an the Disease ontology. Since the aitional ABNER an ontologybase features i not lea to the ientification of entity

4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, features that seeme to istinguish PPI-relevant from irrelevant ocuments (as iscusse below, we o not escribe the process of extracting such features here. The only entity feature that prove useful in iscriminating relevant an irrelevant ocuments in the training ata was the count of protein mentions in abstracts an figure captions as recognize by ABNER. Figure 2 epicts a comparison of the counts of ABNER protein mentions in two specific portions of all ocuments of the Biocreative II.5 training ata: the boy, an the abstract plus figure captions. As can be seen, the counts of protein mentions in the boy of the full-text ocuments in the training ata oes not iscriminate between relevant an irrelevant ocuments. In contrast, the same counts restricte to abstracts an figure caption passages are ifferent for relevant an irrelevant ocuments. We use this type of plot to ientify which features an which ocument portions behave ifferently for relevant an irrelevant ocuments; only the counts of ABNER protein mentions in abstracts an figure captions were sufficiently istinct between the two classes. Base on observations of plots such as those epicte in Figure 2, we ecie not to test these aitional features on training ata. It is not possible for us to ientify exactly why the entity count features we teste faile to iscriminate between ocuments labele relevant an irrelevant in the training ata. Because we ha no access to annotations of protein mentions on the full-text corpus, we cannot compute the failure rates of the entity recognition tools we use (i.e. ABNER. 2.2 Methos The ieal wor-pair features in the p T P /p T N plane are those closest to either one of the axes. Any feature w is a vector on this plane (see figure 3, therefore feature relevance to each of the classes can be measure with the traitional trigonometric measures of the angle (α between this vector an the p T P axis: cos(α is a measure of how strongly features are associate with positive/relevant ocuments, an sin(α with negative/irrelevant ones in the training ata. Then, for every ocument, we compute the sum of all feature contributions for a positive (P an negative (N ecision: P ( = w N( = w cos(α(w = w sin(α(w = w p T P (w p 2 T P (w + p 2 T N (w, (1 p T N (w p 2 T P (w + p 2 T N (w The ecision of whether ocument is a member of the PPI-relevant (TP or irrelevant(tn set of ocuments is then compute as: { T P, T N, if P ( N( λ 0 + β n k( k β otherwise (2 Fig. 3. Trigonometric measures of term relevance in the P T P /P T N plane; P T P an P T N compute from labele ocuments in training ata. where λ 0 is a constant threshol for eciing whether a ocument is positive/relevant or negative/irrelevant. This threshol is subsequently ajuste for each ocument with the factor (β k n k(/β, where β is another constant, an k n k( is a series of counts of topic-relevant entities in ocument. As iscusse above, the only entity that prove useful in iscriminating between relevant an irrelevant ocuments in the training ata of the BC II.5 challenge was the count of protein mentions in the abstracts an figure captions of ocuments as recognize by ABNER. Therefore, in this case k n k( becomes simply np(, which is the number of protein mentions in the abstract an figure captions of. In formula 2, the classification threshol linearly ecreases as k n k( increases. The assumption is that the more relevant entities are recognize in a ocument, the higher the chances that the ocument is relevant. In this case, this means that the higher the number of ABNER-recognize protein mentions, the easier it is to classify a ocument as PPI-relevant; conversely, the lower the number of protein mentions, the easier it is to classify a ocument as PPI-irrelevant. When k n k( = β, the threshol is simply λ 0. We refer to this classification metho as Variable Trigonometric Threshol (VTT. Examples of the ecision surface for training ata are epicte in Figures 4 an 5, an are explaine below. A measure of confience in the classification ecision for ranking ocuments is naturally erive from formula 2: confience shoul be proportional to the value δ( = P (, N( T ( where T ( = λ0 + β n k( k β is the threshol point for ocument. Thus, the further away from the ecision surface a ocument is, the higher the confience in the ecision. Therefore, δ( is a measure of istance from a ocument s ratio of feature weights (P (/N( to the ecision surface or threshol point for that ocument, T (. Since BC II.5 require a confience value in [0, 1], we use the following measure of confience C of the ecision mae for a ocument :

5 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, 2010 TABLE 2 VTT parameters for online runs. (3 Training of the VTT classifier consiste of exhaustively searching the parameters λ0 an β that efine its linear surface, while oing k-fol cross-valiation (K = 8 on both of the training ata sets escribe in section 1: the first with ocuments from the BC2 an BC II.5 challenges, an the secon with aitional MIPS ata. We swept the following parameter range: λ0 [0, 10] an β [1, 100], in steps of λ = an β = 1. For each (λ0, β pair, we compute the mean of the Balance FScore (F1 an Accuracy measures for the 8-fols of each training ata set3. Given the two training ata sets an two performance measures, we chose VTT parameter-sets to be those that minimize the prouct of ranks obtaine from computing each performance measure on a specific training ata set. More specifically, we compute four ranks for each classifier teste in the parameter search T1 rank classifiers accoring to the mean stage: rft1 an ra value of F-Score an Accuracy in the 8-fols of the first T2 rank classifiers training ata set, respectively; rft2 an ra accoring to the mean value of F-Score an Accuracy in the 8-fols of the secon training ata set, respectively. We then ranke all classifiers teste accoring to the rank T1 T2 1/4 prouct of these four ranks: R = (rft1 ra rft2 ra [18]. This proceure was performe for the two istinct wor-pair feature sets: SP an LP. Our training strategy was base on a balance scenario with equal numbers of positive (PPI-relevant an negative ocuments. We then submitte 5 runs to the online challenge: 1 Best parameter-set for SP features, which was the top performer in the first training ata set (ata from BC2 an BC II.5 when using SP features. 2 Best parameter-set for LP features, which was the top performer in both training ata sets when using LP features. 3 Secon-best parameter set for SP features, which was the top performer in the secon training ata set (ata from MIPS when using SP features. 4 Best parameter set for SP features without the variable threshol compute from ABNER s entity recognition (np( = β, an traine only on the first training ata set (no MIPS ata. P +T N 2.T P an F1 = 2T P +F, where 3. Accuracy = T P +FT P +T N +F N P +F N T P, T N, F P, an F N refer to true positives, true negatives, false positives, an false negatives, respectively. np( Training Run 2 LP Y Y P( / N( np( 2.3 Run 1 SP Y Y Run 3 SP Y Y Run 4 SP N N Run 5 LP N N np( Feature Set Entity Feature MIPS ata β λ0 where max δ( is the maximum value of istance elta foun in the training ata. If a test ocument t results in a δ(t that is larger than max δ(, C(t = 1. In BC II.5, we ranke positive ocuments by ecreasing value of C, followe by negative ocuments ranke by increasing value of C. P( / N( np( δ( C( = max (δ( 5 P( / N( P( / N( Fig. 4. VTT ecision surface for λ0 = an β = 36 for the ocuments in 4- of the 8-fols of the first training ata set, using SP feature set (parameters use in Run 3. Horizontal axis correspons to the value of P (/N ( an vertical axis correspons to the value of np(, for each ocument. Black (ocuments from BC II.5 challenge an gray (ocuments from BC2 challenge circles represent positive ocuments, whereas re (ocuments from BC II.5 challenge an orange (ocuments from BC2 challenge circles represent negative ocuments. 5 Best parameter set for LP features without the variable threshol compute from ABNER s entity recognition (np( = β, an traine only on the first training ata set (no MIPS ata. The VTT parameter-sets for these five runs are summarize in table 2. Figures 4 an 5 epict the VTT ecision surfaces with some of the submitte parameters for the two training ata sets an wor-pair features. 2.4 Results During the online part of the challenge, two minor technical issues arose. The first was an inconsistency in the Unicoe ecoing of online-submitte ocuments that cause some features not to be extracte correctly. The secon was a caching problem that cause miscalculation of ABNER counts (entity feature, see 2.1 for many

6 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, TABLE 4 VTT scores after Unicoe an ABNER cache correction. np( Run 1 Run 2 Run 3 Run 4 Run 5 TP FP FN TN Specificity Sens./Recall Precision F Accuracy MCC P at Full R AUC ip/r P(/N( Fig. 5. VTT ecision surface for λ 0 = an β = 72 for the ocuments in 1- of the 8-fols of the secon training ata set, using LP feature set (parameters use in Run 2. Horizontal axis correspons to the value of P (/N( an vertical axis correspons to the value of np(, for each ocument. Black circles represent positive ocuments (from MIPS, whereas re circles represent negative ocuments (from the BC II.5 Challenge. TABLE 3 Official VTT scores for online runs. Run 1 Run 2 Run 3 Run 4 Run 5 TP FP FN TN Specificity Sens./Recall Precision F Accuracy MCC P at Full R AUC ip/r ocuments. Despite these errors, all of the submitte runs performe very well. The official scores of the five runs against the online test set are provie in table 3. After the challenge, we correcte the Unicoe an ABNER cache errors an compute new performance measures for the same five classifier parameters (see table 2 4. The correcte scores are shown in table 4. Notice that the re-submitte runs i not entail retraining the classifiers using information from the test ata available after the challenge. Inee, we use the same VTT parameters in the original an re-submitte runs (table 2, as obtaine by the reproucible training algorithm escribe in 2.3. We present the correcte results to emonstrate the merits of the metho com- 4. We use the gol stanar an evaluation script provie by the competition organizers after the BC II.5 challenge; we ae the calculation of Precision, Recall an Balance F-Score. pute without errors, especially because it is important to etermine the benefits of using entity recognition via ABNER, the algorithm component which was most irectly affecte by the errors. Because there are various ways to measure misclassification (type I an II errors given the confusion matrix of (the number of True Positives (TP, False Positives (FP, True Negatives (TN, an False Negatives (FN, there is no perfect way to characterize the performance of binary classifiers [20]. Therefore, it is important to compute performance using various measures [21]. One reasonable way to obtain an overall ranking of performance of a binary classifier c is to combine a few stanar measures via the rank prouct [18]: RP (c = k k r c,m (4 m=1 where k is the number of measures consiere an r c,m is the rank of the performance of classifier c accoring to measure m. The best classifiers are then those that minimize overall RP. To provie a well-roune assessment of performance using the rank prouct, well-establishe performance measures with istinct characteristics are neee. The Biocreative II.5 challenge evaluation relies on various measures of performance; we center our iscussion on four of them: Area Uner the interpolate precision an recall Curve (AUC, Accuracy, Balance F-Score (F 1, an Matthew s Correlation Coefficient (MCC. AUC [22], [23] was the preferre measure of performance for this challenge as it is robust an ieal for evaluating the quality of ranke results for all recall percentages. Nonetheless, it oes not account irectly for misclassification errors; for instance, the runs submitte by team 13 5 labele every ocument as positive, yet ha the 6th best AUC in the challenge (r 13,AUC = 6, after runs from team 20 6 an our own team 9. Accuracy is the proportion of true results, which is a stanar measure for assessing the performance of binary classification [20], [21]. F 1 is also 5. Hongfang Liu s team at Georgetown University. 6. Kyle Ambert an Aaron Cohen at Oregon Health & Science University.

7 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, TABLE 5 Central tenency an variation of performance measures for all submissions to the ACT of the BC II.5 Challenge Accuracy MCC P at Full R AUC F 1 Mean St. Dev Meian % Conf % Conf a stanar measure of classification effectiveness [20]; it is a balance measure of the proportion of correct results from the returne results (precision an from those that shoul have been returne (recall. Because F 1, unlike Accuracy, oes not epen on the number of true negatives, it is important to take into account both measures, especially in the unbalance scenario of this challenge where the abunance of negative (irrelevant articles leas to high values of the Accuracy measure for classifiers biase for negative classifications [21]. The MCC measure 7 [24] is a well-regare, balance measure for binary classification very well suite for unbalance class scenarios such as this challenge [21]. These four measures assess istinct aspects of binary classification, thus yieling a well-roune view of performance when combine via the rank prouct of formula 4. There is no nee to inclue other performance measures such as sensitivity an specificity in the set of measures in our performance rank prouct: sensitivity is the same as recall 8, alreay taken into account by the F-Score, an specificity (or True Negative Rate is of little utility when classes are unbalance with many more negative (irrelevant ocuments, as in this challenge. Moreover, incluing these two measures in our rank prouct oes not change the rank of the top two performing runs for the entire challenge (for original or re-submitte runs. All five of our submitte runs were well above the central tenency of the runs submitte by all teams (in the collection of online an offline submissions. Inee, the performance of all of our submitte runs are above the 95% confience interval of the mean of all submitte runs. Table 5 epicts the central tenency an variation of the performance measures for the runs submitte to the challenge by all participating groups. Table 6 shows the overall top five original runs submitte to the ACT of the BC II.5 Challenge, ranke in increasing value of the rank prouct of formula 4; Table 7 shows the overall top five runs after correction of the Unicoe an ABNER cache errors. Accoring to the rank prouct of the four measures iscusse above, our correcte, post-challenge run 1 is the top classifier, followe by the best run from team 20 an our other 4 runs (5, 4, 2, 3, respectively. If 7. MCC = 8. T P T P +F N (T P.T N F P.F N (T P +F P (T P +F N(T N+F P (T N+F N TABLE 6 Rank Prouct performance of top 5 original submissions to the ACT of the BC II.5 Challenge. Also shown are iniviual ranks for the four constituent performance measures. Runs RP AUC F 1 Accuracy MCC Team Team 9: Team 9: Team 9: Team TABLE 7 Rank Prouct performance of top 5 submissions to the ACT of the BC II.5 Challenge, after Unicoe an ABNER cache correction. Also shown are iniviual ranks for the four constituent performance measures. Runs RP AUC F 1 Accuracy MCC Team 9: Team Team 9: Team 9: Team 9: we o not consier our re-submitte runs, then the best run from team 20 is the top performer, followe by our submitte official runs 5, 4, an 1, followe by three runs from Team Therefore, even without consiering our re-submitte runs, the VTT classifier was one of the top two performers overall. Looking at the four measures of performance iniviually, of the original submissions VTT run 5 was the top performer for MCC an F 1, while VTT run 4 was the top performer for Accuracy an secon-best for AUC after team 20. When we consier the resubmitte runs, VTT run 1 was the top performer for Accuracy, MCC, an F 1, while VTT run 4 achieve the best AUC score which was the preferre performance measure in the challenge. However, when we consier the other performance measures, this classifier was not our best performer. Using the rank prouct measure, we conclue that the parameter-set use for Run 1, once properly compute in Run 1, le to the most wellroune classifier an the top performer for Accuracy, MCC, an F 1, while at the same time obtaining quite a goo AUC score. The presence of the entity (ABNER counts feature ifferentiates Runs 1 an 4. We observe that using this feature le to the most well-roune submission (Run 1, but not using it le to the best AUC measurement (Run 4. We also observe that the use of aitional MIPS ata for training purposes i not lea to any improvement in this challenge, as the parameter-sets for Runs 1 an 4 were also the best foun for the first ata set alone. Moreover, Run 3 (an 3, which use the best parameter-set for training on MIPS ata, was our poorest 9. The team of Yong-gang Cao, of the University of Wisconsin- Milwaukee.

8 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, AUC Accuracy Fig. 6. Accuracy an AUC performance of VTT runs in comparison with the other top performing submission (group 20. The portion of the plane shown is well above the 95% confience interval of the mean for all submissions to the ACT of the BC II.5 challenge. Blue iamons represent the official VTT online submissions, an the re squares represent the same runs after fixing Unicoe an ABNER cache errors. The green triangle represents the other top performer in this challenge. F1 MCC Fig. 7. MCC an F 1 performance of VTT runs in comparison with the other top performing submission (group 20. The portion of the plane shown is well above the 95% confience interval of the mean for all submissions to the ACT of the BC II.5 challenge. Blue iamons represent the official VTT online submissions, an the re squares represent the same runs after fixing the Unicoe an ABNER cache errors. The green triangle represents the other top performer in this challenge. performer. Finally, we o not observe a istinct benefit of using one or the other type of wor-pair features: while the SP feature set was use in the our best run (1, the LP feature set was use in our secon-best run (5. Figures 6 an 7 epict in graphical form the performance of our submissions for the 4 performance measures above, in comparison with the other top performer (the classifier from Team 20 in the ACT component of this challenge. 3 CITATION NETWORK CLASSIFIER 3.1 Metho We also evelope the Citation Network Classifier (CNC to ientify PPI-relevant articles using features extracte from citations an aitional information erive from the citation network of the bibliome. We i not employ this classifier in the online part of the challenge because citation information was only available in the offline, XML-version of the test set. Its lightweight performance, however, makes it suitable for real-time classification. We implemente this metho using a Naive Bayes classifier on the following equally-weighe citation features: (1 cite PubMe IDs (PMIDs (2 citation authors an (3 citation author/year pairs. We calculate p(class = P P I F eature = f an p(class = non-p P I F eature = f for the features foun in the ocuments in the training set, smoothe the istributions using Laplace s rule (smoothing parameter of 0.01, an selecte the top features using their Chi-square rank (top features in runs 1, 2, 4 an 5, an the top in run 3. Aitionally, uring scoring we treate each ocument s own authors as if they were cite by that article three times; this allowe authorship information to be inclue an play a role in improving classifier performance. During classification, each ocument was assigne to the class with the Maximum A Posteriori probability (MAP ecision rule given that ocument s features. An uninformative equiprobable class prior was use. Aitionally, I(Class; F eatures the mutual information between a ocument s class an citation features was use as a classification confience score. It was calculate as the ecrease in uncertainty (entropy between the prior an posterior class istributions: I(Class; F eatures = H(Class H(Class F eatures Because the uncertainty present in the prior class istribution of a binary classifier is at most 1 bit, an because entropy is always positive an oes not increase uner conitioning [25], this quantity naturally falls in the unit interval. One significant issue encountere uring the implementation of this classifier was the lack of an easilyaccessible atabase of biological citations, or a comprehensive repository of parsable biological articles from which one coul easily be built. We create our own citation atabase using a combination of scraping an parsing scripts. Starting from a list of PMID from the training ata for which citation ata was neee, we querie PubMe for publication information an then attempte to locate an ownloa articles in PDF format from journal websites. When a PDF-version of an article was retrieve, its raw textual content was first obtaine using the pf2text converter, then the parscit parser [26] was use to extract XML-formatte bibliographic references. Successfully parse reference ata was converte into PMIDs using the PubMe search API, which

9 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, resulte in a list of cite PMIDs for each initial PMID. Our scripts were initially run on articles cite by ocuments in the BC II.5 training set; further iterations then looke for articles cite by those articles, an so on recursively. Using this metho, we acquire approximately PDF files, from which approximately PMIDs, reference PMIDs, an citations were extracte. The set of cite articles an authors to be foun in test ata is potentially enormous. Moreover, the training ata provies class information (P (Class F eature istributions for only a small number of citation features. Using co-citations allowe this class information to iffuse over the links of the harveste citation network. For this purpose, we use a cocitation measure from feature A to feature B: ω(a, B = # times feature A is co-cite with feature B # times feature A is co-cite total When a citation feature without class information was foun in a test article, its class istribution was approximate as a linear combination of the weights of the eges to its neighbors in a cocitation network efine by the ω(a, B measure escribe above. This network was built using the three types of citation features PMIDs, authors, an author/year pairs. Feature co-citations that occurre only once were eliminate in all our runs. It shoul be note that the cocitation network is a irecte weighte graph, since the cocitation measure above is not symmetric. An asymmetry woul result if one article or author was usually cite in combination with another, but the latter was also cite in many cases where the former was not. In this situation, the former woul have a stronger ω weight to the latter than vice-versa. Finally, we also integrate the CNC with the VTT classifier, configure with the parameter values use in our online submission 4. This was one in the following manner: if the istance of a ocument to the ecision surface of VTT, as quantifie by the δ( measure explaine in 2.2, was above a certain constant, the VTT result was use, otherwise class membership was ecie by the classifier with largest confience (VTT or CNC. In that case, the combine confience was the sum (ifference of the confience values of the two classifiers when they agree (isagree in their class label assignment, ivie by Results The CNC was traine on the combination of the Biocreative II.5 training set (595 ocuments 10 an the Biocreative 2 training set (5495 ocuments. The 10 most informative features foun by CNC are liste in table 8. The PubMe IDs in this table refer to two highly-cite 10. While the initial training set release for BC II.5 containe = 609 articles, a subsequent version of the training set containe only = 595 articles. We use the first set in the training of VTT for the online challenge, but the more recent one in the training of CNC for the offline challenge. TABLE 8 Highest scoring features foun by the CNC algorithm. Citation P (F P P I P (F non-p P I PMID: E E-04 Ellege SJ 2.80E E-05 Gygi SP 2.19E E-04 Fiels S 2.99E E-05 Gorg A 1.83E E-04 Sanchez JC 9.12E E-04 PMID: E E-04 Creasy DM 4.57E E-04 Cooper JA 1.99E E-05 Aebersol R 5.02E E-04 TABLE 9 CNC parameters for offline runs. Run 1 Run 2 Run 3 Run 4 Run 5 # Features Co-citation ata N Y Y N Y Mix with VTT N N N Y Y TABLE 10 Official CNC scores for offline runs. Run 1 Run 2 Run 3 Run 4 Run 5 TP FP FN TN Specificity Sens./Recall Precision F Accuracy MCC P at Full R AUC ip/r protein-relate but not PPI-relate articles ([27], [28], which were foun frequently in the negative training ata. Among the other authors liste, Ellege SJ, Fiels S, an Cooper JA have all publishe important works in the PPI omain, while the remaining have publishe extensively in proteomics-relate (but again, not PPI-relate literature. We submitte 5 runs to the offline challenge: 1 Naive Bayes classifier using the top citation features. 2 Same as (1 but where citation features are supplemente with the co-citation weight ω. 3 Same as (2 but with top citation features. 4 Same as (1 but in combination with VTT as escribe above, using a VTT confience cutoff parameter of Same as (2 but in combination with VTT as escribe above, using a VTT confience cutoff parameter of The parameter-sets for these runs are liste in table 9. Table 10 shows the official performance for these five runs submitte to the offline challenge. The performance of the offline CNC runs was lower than what we obtaine for VTT in the online part of

10 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, TABLE 11 CNC scores after algorithm corrections. PatFulR 2 F1 Fig. 8. F 1 an P at Full R performance of offline CNC runs in comparison with the other top performing submission (group 20. Also shown as an orange rectangle is the 95% confience interval of the mean for all submissions to the ACT of the BC II.5 challenge, for these two performance measures. The black cross enotes the mean value, an the gray star the meian. Blue iamons represent the official VTT online submissions, an the re squares represent the same runs after fixing the Unicoe an ABNER cache errors. Blue circles represent the CNC runs; we can see that Runs 4 an 5 are clearly top accoring to the P at Full R performance measure. The green triangle represents the other top performer in this challenge. the challenge. Nonetheless, for most performance measurements, these runs were still above the mean value for all submissions to the BC II.5 challenge; all of F 1 an most of the MCC measurements were above the meian value, an all measurements of Accuracy were above the 95% confience of interval of the mean. Runs 4 an 5, which combine CNC with VTT, lea to measurements of AUC, Accuracy, MCC, an F 1 above the 95% confience interval of the mean, though still below the online submissions with VTT alone. Interestingly, these runs also lea to the top two measurements of Precision at Full Recall (P at Full R for the entire challenge, both well above the 99% confience interval of the mean of all submissions. While the P at Full R measure is not a measure of overall goo performance for binary classification, this result shows that integrating CNC with VTT leas to an improvement in the rate of misclassifications, if we want to guarantee full recall (retrieval of every relevant ocument. Figure 8 epicts in a graphical form the performance of all our submissions for the F 1 an P at Full R measures. Unfortunately, after the challenge we iscovere several issues that affecte the performance of our CNC submissions in the offline ACT challenge. First, some improperly parse ata neee to be remove from the citation network atabase. More importantly, the classifier s AUC scores were iminishe because the original CNC confience score was not properly normalize; the mutual-information-base confience score calculation was only correcte post-challenge. In aition, two pa- Run 1 Run 2 Run 3 Run 4 Run 5 TP FP FN TN Specificity Sens./Recall Precision F Accuracy MCC P at Full R AUC ip/r rameters were ae in orer to increase co-citation algorithm spee an ecrease the sprea of spurious correlations: for features lacking class istributions, one parameter limite potential co-citation neighbors to only a given number of top traine features (as ranke by Chi-square score, while the other parameter limite co-citation links to cases where ω was above a certain threshol. The settings of these parameters 800 top features an an ω threshol of 0.3 were chosen by picking parameter values that maximize F 1 scores when teste on the BC II.5 training set after training on the BC2 training set. Revise scores for the CNC are shown in table 11, where we can see that the performance obtaine for the four most important measures improve. Though the performance of P at Full R slightly ecline, it still remaine well above the performance of all other submissions to the challenge. From the ifference between Run 1 an Run 2 as well as Run 4 an Run 5, we also observe that incluing co-citation ata reuce the number of false positives, resulting in an improvement in Accuracy an AUC. However, in terms of the rank prouct measure of performance (formula 4, this improvement is marginal: RP (CNC Run5 = 14.8, RP (CNC Run4 = 14.9, RP (CNC Run2 = 18.7, RP (CNC Run1 = 20.7, where these runs ranke 13 th, 14 th, 18 th, an 19 th, respectively, out of 37 total runs submitte to the ACT of the BC II.5 challenge. Interestingly, even with the postchallenge changes, combining CNC with the VTT algorithm using a VTT confience cutoff parameter of 0.35 improve CNC performance but coul not outperform VTT by itself. This was the case even in trials when CNC was mixe with VTT scores at a very low confience level (not shown. 4 DISCUSSION AND CONCLUSION From our previous work [5], we knew that the lightweight VTT metho performe well in the classification of PPI-relevant abstracts. Given our results in the ACT of the BC II.5 challenge, we can now conclue that it also performs very well in a full-text scenario. Inee, the VTT classifier, when correcte for the minor errors iscusse in 2.4, was able to out-perform every

11 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, np( np( P(/N( P(/N( np( np( P(/N( P(/N( Fig. 9. VTT ecision surface for the best four of five VTT submissions (after correction of Unicoe an ABNER Cache errors. Horizontal axis correspons to the value of P (/N( an vertical axis correspons to the value of np(, for each ocument. Black pluses represent positive ocuments, re circles represent negative ocuments. other submission to this challenge accoring to the rank prouct of the four main performance measures (table 7. Even when consiering the official VTT submissions (with Unicoe an ABNER cache errors, the best VTT run was the secon-best submission of the entire challenge accoring to the same measure (table 6; see 2.4 for etails. Interestingly, VTT uses only a small number of wors extracte from the text (1000, minimal entity recognition (protein mentions via the off-the-shelf ABNER [19], an a linear ecision surface. Yet, this metho was very competitive against more sophisticate systems in both the Biocreative 2 [5] an Biocreative II.5 challenges. Perhaps the key to the success of this lightweight metho in this challenge is the real-worl nature of the BioCreative ata sets. Because the testing an training ata are obtaine in realistic annotation an publication scenarios, rather than sample from prepare corpora with statistically-ientical feature istributions, more sophisticate machine learning classifiers ten to overfit the training ata without generalizing as well the concept of protein-interaction from the bibliome. The rift between training an testing ata was a real issue in BC2 [5], an we have evience that the same may have occurre in the BC II.5 challenge. We traine a classical classifier to istinguish between the training an testing corpora. Specifically, we use 4-fol cross-valiation to train on subsets of articles from the BC II.5 training an testing sets, now labele accoring to membership in the training or testing sets rather than PPI-relevance or irrelevance. Classifier features were selecte, after Porter-stemming an stopwor-removal, as the top 1000 single wors ranke accoring to their information-gain score [29]. Document vectors, with those same information-gain scores for term-weights, were use to train a Support Vector Machine (SVM classifier (we use the SVM-light package [30] with a linear kernel an efault parameters. Accoring to F-Score an AUC measures, the two corpora can be classifie an are therefore sufficiently istinct, exhibiting a significant amount of rift. When we use only PPI-relevant articles from the training an testing ata the SVM classifier obtaine: F 1 =0.63 an AUC=0.76. When we use only PPI-irrelevant articles the SVM classifier obtaine: F 1 = 0.54 an AUC=0.78. When we consiere both PPI-relevant an irrelevant articles, the SVM classifier obtaine: F 1 = 0.63 an AUC=0.79. All scores were average over eight 4-fol runs. If the training an testing ata were inistinguishable (rawn from the same statistical istribution AUC an F-Score woul be near 0.5. Clearly, this is not the case with this ata, nor shoul it be expecte from the real-worl scenario of BC II.5. We also see that rift occurs for both PPI-relevant an irrelevant articles. Figures 4 an 5 show how the positive an negative ocuments in the training ata, using our wor-pair features, can be easily separate by a linear surface. If we were to use a more sophisticate ecision surface, it is quite possible that they woul obtain much better class separation on the training ata. Inee, we alreay observe in BC2 that SVM an Singular Value Decomposition classifiers obtaine higher performance in the training ata than VTT (as measure by accuracy an F-Score, but lower in the testing ata [5]. Since VTT ha alreay been compare to traitional classifiers such as SVM [5], in this challenge we i not submit runs with those kins of classifiers an instea chose to test more parameters of the VTT an the novel CNC. Therefore, to ecie if algorithms submitte to the challenge with more sophisticate ecision surfaces suffere from the rift between training an testing, we woul nee access to their performance on the training ata, not just the available results on testing ata. Given the overall performance of VTT, we can at least say that this metho was highly competitive in ealing with the measurable rift between training an testing ata. Figure 9 epicts the ecision surfaces of the VTT metho for four (correcte submissions on the final test ata. While better surfaces clearly exist to classify the test ata, the linear surface of the VTT metho avoie over-

12 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, fitting, an was very goo at generalizing the concept of protein-protein interaction from the bibliome in the not fully statistically ieal, real-worl scenario of BC II.5 while remaining lightweight computationally. We also conclue that training with aitional ata from MIPS, which contains articles from various publication sources rather than a single journal, was not very avantageous. This seems to argue against the ability of the VTT metho to generalize the real-worl concept of protein-protein interaction. However, the real-worl in this task is the scenario of FEBS Letters curators attempting to ientify PPI relevant ocuments among the articles submitte to this journal all systems were ultimately only teste on the FEBS Letters test set, an not in etermining PPI-relevance at large. As for using features extracte using entity recognition, we can say that counting protein mentions via ABNER in abstracts an figure captions was moerately avantageous (though not using it le to a higher AUC score. We also observe uring training that using other entities from ABNER an relevant ontologies (see 2.1 was not avantageous. Therefore, while using ABNER protein counts i not lea to a large improvement in classification, it was the only entity we were able to ientify which le to a moerate improvement in classification using the VTT metho. The performance of the newly-introuce CNC algorithm in the ACT task was not competitive with the best content-base classifiers, but was still aboveaverage an provies a proof-of-concept emonstration of the applicability of the citation-network metho to the biomeical ocument classification omain. Our implementation points to several approaches that coul be investigate in the search for high-performance citationnetwork base classification. First, we i not use counts of how many times each reference was cite in a ocument, though use of such weighte features coul inicate the citations that are most informative about a given article s class label. Aitionally, incluing the title of the citing ocument section in the citation features coul lea to better performance. Different sections may reference articles for ifferent reasons; citations from the Methoology section, for example, may be particularly useful in ientifying ocuments relevant to a specific biomeical subfiel, as in the ACT task. Finally, another way to capture citation styles relevant to omain-specific classification woul involve combining citation features with statistically-significant tokens from citing sentences, which are known as citances an have alreay receive some attention in the biomeical text-mining fiel [31]. Performance of the CNC epens not only on the algorithm an training ata, but also on the unerlying citation atabase from which ω weights are compute. We observe (see 3.2 that incluing co-citation ata reuce the number of false positives, but ultimately le to a marginal overall performance improvement. The citation network use in our work, however, is extremely limite in coverage an subject to parsing errors. An accessible, high-quality repository of biomeical citation ata woul go a long way towars avancing citation-network base classifiers in the fiel. Inee, literature omains where such repositories exist, such as the publicity-available US patents atabase, have seen wier application of co-citation-base algorithms (see, for example, [32], [33]. In summary, we have shown that our VTT classifier, previously applie to abstracts only, is also very competitive in the classification of PPI-relevant ocuments in a real-worl, full-text scenario such as the one provie by BC II.5. Moreover, the novel CNC is the first application of a citation-base classifier to the PPI omain an is thus a promising new avenue for further investigation in bibliome informatics. AUTHORS CONTRIBUTIONS Artemy Kolchinsky evelope an implemente the CNC metho, helpe set up the online server, participate in various experimental an valiation computations, an helpe write the manuscript. Alaa Abi- Haiar helpe evelop the VTT metho, prouce the coe necessary for pre-processing abstracts an computing training ata partitions, participate in various experimental an valiation computations, an helpe with proucing figures for the manuscript. Jasleen Kaur helpe set up the online server as well as with ata preprocessing. Ahme Abeen Hame conucte feature extraction experiments from various ontologies. Luis M. Rocha was responsible for integrating the team an esigning the experimental set up, writing the manuscript, as well as eveloping the VTT metho. ACKNOWLEDGMENTS We are very thankful to the eitors an reviewers of this article for the very etaile an useful reviews provie. We woul like to acknowlege the help of Prerag Raivojac an Nils Schimmelmann, who provie the aitional MIPS ata use by our team. We woul also like to thank the FLAD Computational Biology Collaboratorium at the Gulbenkian Institute in Oeiras, Portugal, for hosting an proviing facilities use to conuct part of this research. REFERENCES [1] L. Hunter an K. Cohen, Biomeical language processing: What s beyon pubme? Molecular Cell, vol. 21, no. 5, pp , [2] Pubme. [Online]. Available: [3] H. Shatkay an R. Felman, Mining the biomeical literature in the genomic era: An overview, Journal of Computational Biology, vol. 10, no. 6, pp , [4] L. J. Jensen, J. Saric, an P. Bork, Literature mining for the biologist: from information retrieval to biological iscovery. Nat Rev Genet, vol. 7, no. 2, pp , Feb [Online]. Available:

13 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, [5] A. Abi-Haiar, J. Kaur1, A. Maguitman, P. Raivojac, A. Retchsteiner, K. Verspoor, Z. Wang, an L. M. Rocha, Uncovering protein interaction in abstracts an text using a novel linear moel an wor proximity networks, Genome Biology, p. 9(Suppl 2:S11, [Online]. Available: [6] L. Hirschman, A. Yeh, C. Blaschke, an A. Valencia, Overview of biocreative: critical assessment of information extraction for biology, BMC Bioinformatics, vol. 6 Suppl 1, p. S1, [7] Proceeings of the Secon BioCreative Challenge Evaluation Workshop, vol. ISBN , [8] S. Chakrabarti, Mining the Web: Analysis of Hypertext an Semi Structure Data. Morgan Kaufmann, [9] I. Anroutsopoulos, J. Koutsias, K. V. Chanrinos, an C. D. Spyropoulos, An experimental comparison of naive bayesian an keywor-base anti-spam filtering with personal messages, Annual ACM Conference on Research an Development in Information Retrieval, [Online]. Available: [10] T. Joachims, Learning to classify text using support vector machines: methos, theory, an algorithms. Kluwer Acaemic Publishers, [11] R. Felman an J. Sanger, The Text Mining Hanbook: avance approaches in analyzing unstructure ata. Cambrige: Cambrige University Press, [12] F. Sebastiani, Machine learning in automate text categorization, ACM Computing Surveys (CSUR, vol. 34, no. 1, [Online]. Available: [13] M. Krallinger an A. Valencia, Evaluating the etection an ranking of protein interaction relevant articles: the biocreative challenge interaction article sub-task (ias, in Proceeings of the Secon Biocreative Challenge Evaluation Workshop, [14] H. W. Mewes, C. Ami, R. Arnol, D. Frishman, U. Gulener, G. Mannhaupt, M. Munsterkotter, P. Pagel, N. Strack, V. Stumpflen, J. Warfsmann, an A. Ruepp, Mips: analysis an annotation of proteins from whole genomes. Nucleic Acis Res, vol. 32, no. Database issue, pp. D41 D44, Jan [Online]. Available: [15] F. Fez-Riverola, E. Iglesias, F. Diaz, J. Menez, an J. Corchao, Spamhunting: An instance-base reasoning system for spam labelling an filtering, Decision Support Systems, vol. In Press, [Online]. Available: 4MR7D2T-1/2/1ea88c60a24a977e08f2be9c577f6b0 [16] G. Salton an C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing & Management, vol. 24, no. 5, pp , [Online]. Available: [17] M. Porter, An algorithm for suffix stripping, Program, vol. 13, no. 3, pp , [18] R. Breitling, P. Armengau, A. Amtmann, an P. Herzyk, Rank proucts: a simple, yet powerful, new metho to etect ifferentially regulate genes in replicate microarray experiments. FEBS letters, vol. 573, no. 1-3, pp , August [Online]. Available: [19] B. Settles, Abner: an open source tool for automatically tagging genes, proteins an other entity names in text, Bioinformatics, vol. 21, no. 14, pp , [20] R. Baeza-Yates an B. Ribeiro-Neto, Moern Information Retrieval. New York: ACM Press, Aison-Wesley, [21] P. Bali, Assessing the accuracy of preiction algorithms for classification: an overview, Bioinformatics, vol. 16, no. 5, pp , May [Online]. Available: [22] L. E. Do an M. S. Pepe, Partial auc estimation an regression, Biometrics, vol. 59, no. 3, pp , [Online]. Available: [23] T. Fawcett, An introuction to roc analysis, Pattern Recognition Letters, vol. 27, no. 8, [Online]. Available: [24] B. W. Matthews, Comparison of the preicte an observe seconary structure of t4 phage lysozyme. Biochimica et biophysica acta, vol. 405, no. 2, pp , October [Online]. Available: [25] T. Cover an J. Thomas, Elements of information theory. John Wiley & Sons, New York, [26] I. Councill, C. Giles, an M. Kan, Parscit: An open-source crf reference string parsing package, in Proceeings of LREC, [27] U. Laemmli et al., Cleavage of structural proteins uring the assembly of the hea of bacteriophage t4, Nature, vol. 227, no. 5259, pp , [28] D. Perkins, D. Pappin, D. Creasy, J. Cottrell et al., Probabilitybase protein ientification by searching sequence atabases using mass spectrometry ata, Electrophoresis, vol. 20, no. 18, pp , [29] Y. Yang an J. O. Peersen, A comparative stuy on feature selection in text categorization, Proceeings of the Fourteenth International Conference on Machine Learning, [Online]. Available: [30] T. Joachims, Making large-scale support vector machine learning practical, Avances in kernel methos: support vector learning, [Online]. Available: [31] P. Nakov, A. Schwartz, an M. Hearst, Citances: Citation sentences for semantic analysis of bioscience text, in Proceeings of the SIGIR04 workshop on Search an Discovery in Bioinformatics, [32] K. Lai an S. Wu, Using the patent co-citation approach to establish a new patent classification system, Information Processing an management, vol. 41, no. 2, pp , [33] X. Li, H. Chen, Z. Zhang, an J. Li, Automatic patent classification using citation network information: an experimental stuy in nanotechnology, in Proceeings of the 7th ACM/IEEE-CS joint conference on Digital libraries. ACM, 2007, p Artemy Kolchinsky is pursuing a PhD in the complex systems track of the School of Informatics an Computing, Iniana University Bloomington. He is also a visiting grauate stuent at the FLAD Computational Biology Collaboratorium at the Instituto Gulbenkian e Ciencia, Portugal. Alaa Abi-Haiar receive an MS in Computer Science from Iniana University. He is currently a PhD caniate in the School of Informatics an Computing in Iniana University. His current research interests inclue text mining, classification, bio-inspire computing an artificial immune systems. Jasleen Kaur receive an MS in Bioinformatics from Iniana University Bloomington in She is currently pursuing a PhD in Informatics in the complex systems track of the School of Informatics an Computing, Iniana University Bloomington. Her research interests inclue text mining, literature mining, bioinformatics an social networks mining.

14 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, Ahme Abeen Hame has an MS in Computer Science from Iniana University an is a part-time PhD stuent in Computer Science at University of Vermont. His research interests are text mining, web mining, an scientific workflows. He is concerne with ecosystems monitoring an eveloping scientific workflows that can prouce alerts for conservationists an ecision makers. His is currently holing a professional employment with the Marine Biological Laboratory, Woo Hole, Massachusetts. Luis M. Rocha is an Associate Professor at the School of Informatics an Computing at Iniana University, Bloomington, where he has irecte the PhD program on Complex Systems an is also member of the Center for Complex Networks an Systems an core faculty of the Cognitive Science Program. He is also the irector of the FLAD Computational Biology Collaboratorium an in the irection of the associate PhD program in Computational Biology at the Instituto Gulbenkian a Ciencia, Portugal, where the central goal is interisciplinary research involving the life sciences. His research is on complex systems, computational biology, artificial life, emboie cognition an bio-inspire computing. He receive his Ph.D in Systems Science in 1997 from the State University of New York at Binghamton.

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

SANTIAGO CANYON COLLEGE Reading & English Placement Testing Information

SANTIAGO CANYON COLLEGE Reading & English Placement Testing Information SANTIAGO CANYON COLLEGE Reaing & English Placement Testing Information DO YOUR BEST on the Reaing & English Placement Test The Reaing & English placement test is esigne to assess stuents skills in reaing

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

SPECIAL ARTICLES Pharmacy Education in Vietnam

SPECIAL ARTICLES Pharmacy Education in Vietnam American Journal of Pharmaceutical Eucation 2013; 77 (6) Article 114. SPECIAL ARTICLES Pharmacy Eucation in Vietnam Thi-Ha Vo, MSc, a,b Pierrick Beouch, PharmD, PhD, b,c Thi-Hoai Nguyen, PhD, a Thi-Lien-Huong

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Sweden, The Baltic States and Poland November 2000

Sweden, The Baltic States and Poland November 2000 Folkbilning co-operation between Sween, The Baltic States an Polan 1990 2000 November 2000 TABLE OF CONTENTS FOREWORD...3 SUMMARY...4 I. CONCLUSIONS FROM THE COUNTRIES...6 I.1 Estonia...8 I.2 Latvia...12

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

Statewide Framework Document for:

Statewide Framework Document for: Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance

More information

Grade 6: Correlated to AGS Basic Math Skills

Grade 6: Correlated to AGS Basic Math Skills Grade 6: Correlated to AGS Basic Math Skills Grade 6: Standard 1 Number Sense Students compare and order positive and negative integers, decimals, fractions, and mixed numbers. They find multiples and

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

Probability and Statistics Curriculum Pacing Guide

Probability and Statistics Curriculum Pacing Guide Unit 1 Terms PS.SPMJ.3 PS.SPMJ.5 Plan and conduct a survey to answer a statistical question. Recognize how the plan addresses sampling technique, randomization, measurement of experimental error and methods

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education GCSE Mathematics B (Linear) Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education Mark Scheme for November 2014 Oxford Cambridge and RSA Examinations OCR (Oxford Cambridge

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE Edexcel GCSE Statistics 1389 Paper 1H June 2007 Mark Scheme Edexcel GCSE Statistics 1389 NOTES ON MARKING PRINCIPLES 1 Types of mark M marks: method marks A marks: accuracy marks B marks: unconditional

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Active Learning. Yingyu Liang Computer Sciences 760 Fall Active Learning Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven,

More information

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Using Web Searches on Important Words to Create Background Sets for LSI Classification Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract

More information

Math 96: Intermediate Algebra in Context

Math 96: Intermediate Algebra in Context : Intermediate Algebra in Context Syllabus Spring Quarter 2016 Daily, 9:20 10:30am Instructor: Lauri Lindberg Office Hours@ tutoring: Tutoring Center (CAS-504) 8 9am & 1 2pm daily STEM (Math) Center (RAI-338)

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts. Recommendation 1 Build on students informal understanding of sharing and proportionality to develop initial fraction concepts. Students come to kindergarten with a rudimentary understanding of basic fraction

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio SCSUG Student Symposium 2016 Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio Praneth Guggilla, Tejaswi Jha, Goutam Chakraborty, Oklahoma State

More information

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Linking the Ohio State Assessments to NWEA MAP Growth Tests * Linking the Ohio State Assessments to NWEA MAP Growth Tests * *As of June 2017 Measures of Academic Progress (MAP ) is known as MAP Growth. August 2016 Introduction Northwest Evaluation Association (NWEA

More information

Are You Ready? Simplify Fractions

Are You Ready? Simplify Fractions SKILL 10 Simplify Fractions Teaching Skill 10 Objective Write a fraction in simplest form. Review the definition of simplest form with students. Ask: Is 3 written in simplest form? Why 7 or why not? (Yes,

More information

Applications of data mining algorithms to analysis of medical data

Applications of data mining algorithms to analysis of medical data Master Thesis Software Engineering Thesis no: MSE-2007:20 August 2007 Applications of data mining algorithms to analysis of medical data Dariusz Matyja School of Engineering Blekinge Institute of Technology

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

Physics 270: Experimental Physics

Physics 270: Experimental Physics 2017 edition Lab Manual Physics 270 3 Physics 270: Experimental Physics Lecture: Lab: Instructor: Office: Email: Tuesdays, 2 3:50 PM Thursdays, 2 4:50 PM Dr. Uttam Manna 313C Moulton Hall umanna@ilstu.edu

More information

Mathematics process categories

Mathematics process categories Mathematics process categories All of the UK curricula define multiple categories of mathematical proficiency that require students to be able to use and apply mathematics, beyond simple recall of facts

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

Algebra 2- Semester 2 Review

Algebra 2- Semester 2 Review Name Block Date Algebra 2- Semester 2 Review Non-Calculator 5.4 1. Consider the function f x 1 x 2. a) Describe the transformation of the graph of y 1 x. b) Identify the asymptotes. c) What is the domain

More information

BMC Medical Informatics and Decision Making 2012, 12:33

BMC Medical Informatics and Decision Making 2012, 12:33 BMC Medical Informatics and Decision Making This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon.

More information

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation

More information

Exposé for a Master s Thesis

Exposé for a Master s Thesis Exposé for a Master s Thesis Stefan Selent January 21, 2017 Working Title: TF Relation Mining: An Active Learning Approach Introduction The amount of scientific literature is ever increasing. Especially

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

NCEO Technical Report 27

NCEO Technical Report 27 Home About Publications Special Topics Presentations State Policies Accommodations Bibliography Teleconferences Tools Related Sites Interpreting Trends in the Performance of Special Education Students

More information

Corrective Feedback and Persistent Learning for Information Extraction

Corrective Feedback and Persistent Learning for Information Extraction Corrective Feedback and Persistent Learning for Information Extraction Aron Culotta a, Trausti Kristjansson b, Andrew McCallum a, Paul Viola c a Dept. of Computer Science, University of Massachusetts,

More information

Matching Similarity for Keyword-Based Clustering

Matching Similarity for Keyword-Based Clustering Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web

More information

Evolutive Neural Net Fuzzy Filtering: Basic Description

Evolutive Neural Net Fuzzy Filtering: Basic Description Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:

More information

This scope and sequence assumes 160 days for instruction, divided among 15 units.

This scope and sequence assumes 160 days for instruction, divided among 15 units. In previous grades, students learned strategies for multiplication and division, developed understanding of structure of the place value system, and applied understanding of fractions to addition and subtraction

More information

Extending Place Value with Whole Numbers to 1,000,000

Extending Place Value with Whole Numbers to 1,000,000 Grade 4 Mathematics, Quarter 1, Unit 1.1 Extending Place Value with Whole Numbers to 1,000,000 Overview Number of Instructional Days: 10 (1 day = 45 minutes) Content to Be Learned Recognize that a digit

More information

Technical Manual Supplement

Technical Manual Supplement VERSION 1.0 Technical Manual Supplement The ACT Contents Preface....................................................................... iii Introduction....................................................................

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Mathematics subject curriculum

Mathematics subject curriculum Mathematics subject curriculum Dette er ei omsetjing av den fastsette læreplanteksten. Læreplanen er fastsett på Nynorsk Established as a Regulation by the Ministry of Education and Research on 24 June

More information

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Ch 2 Test Remediation Work Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Provide an appropriate response. 1) High temperatures in a certain

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar

More information

Houghton Mifflin Online Assessment System Walkthrough Guide

Houghton Mifflin Online Assessment System Walkthrough Guide Houghton Mifflin Online Assessment System Walkthrough Guide Page 1 Copyright 2007 by Houghton Mifflin Company. All Rights Reserved. No part of this document may be reproduced or transmitted in any form

More information

Radius STEM Readiness TM

Radius STEM Readiness TM Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and

More information

Visit us at:

Visit us at: White Paper Integrating Six Sigma and Software Testing Process for Removal of Wastage & Optimizing Resource Utilization 24 October 2013 With resources working for extended hours and in a pressurized environment,

More information

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011 Montana Content Standards for Mathematics Grade 3 Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011 Contents Standards for Mathematical Practice: Grade

More information

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

CHAPTER 4: REIMBURSEMENT STRATEGIES 24 CHAPTER 4: REIMBURSEMENT STRATEGIES 24 INTRODUCTION Once state level policymakers have decided to implement and pay for CSR, one issue they face is simply how to calculate the reimbursements to districts

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Cristian-Alexandru Drăgușanu, Marina Cufliuc, Adrian Iftene UAIC: Faculty of Computer Science, Alexandru Ioan Cuza University,

More information

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS ELIZABETH ANNE SOMERS Spring 2011 A thesis submitted in partial

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for

More information

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.

More information

CS 446: Machine Learning

CS 446: Machine Learning CS 446: Machine Learning Introduction to LBJava: a Learning Based Programming Language Writing classifiers Christos Christodoulopoulos Parisa Kordjamshidi Motivation 2 Motivation You still have not learnt

More information

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics 2017-2018 GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics Entrance requirements, program descriptions, degree requirements and other program policies for Biostatistics Master s Programs

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

The Strong Minimalist Thesis and Bounded Optimality

The Strong Minimalist Thesis and Bounded Optimality The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this

More information

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries Ina V.S. Mullis Michael O. Martin Eugenio J. Gonzalez PIRLS International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries International Study Center International

More information

Mathematics. Mathematics

Mathematics. Mathematics Mathematics Program Description Successful completion of this major will assure competence in mathematics through differential and integral calculus, providing an adequate background for employment in

More information

Truth Inference in Crowdsourcing: Is the Problem Solved?

Truth Inference in Crowdsourcing: Is the Problem Solved? Truth Inference in Crowdsourcing: Is the Problem Solved? Yudian Zheng, Guoliang Li #, Yuanbing Li #, Caihua Shan, Reynold Cheng # Department of Computer Science, Tsinghua University Department of Computer

More information

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice Title: Considering Coordinate Geometry Common Core State Standards

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler Machine Learning and Data Mining Ensembles of Learners Prof. Alexander Ihler Ensemble methods Why learn one classifier when you can learn many? Ensemble: combine many predictors (Weighted) combina

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Proof Theory for Syntacticians

Proof Theory for Syntacticians Department of Linguistics Ohio State University Syntax 2 (Linguistics 602.02) January 5, 2012 Logics for Linguistics Many different kinds of logic are directly applicable to formalizing theories in syntax

More information

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS R.Barco 1, R.Guerrero 2, G.Hylander 2, L.Nielsen 3, M.Partanen 2, S.Patel 4 1 Dpt. Ingeniería de Comunicaciones. Universidad de Málaga.

More information

Facing our Fears: Reading and Writing about Characters in Literary Text

Facing our Fears: Reading and Writing about Characters in Literary Text Facing our Fears: Reading and Writing about Characters in Literary Text by Barbara Goggans Students in 6th grade have been reading and analyzing characters in short stories such as "The Ravine," by Graham

More information

Honors Mathematics. Introduction and Definition of Honors Mathematics

Honors Mathematics. Introduction and Definition of Honors Mathematics Honors Mathematics Introduction and Definition of Honors Mathematics Honors Mathematics courses are intended to be more challenging than standard courses and provide multiple opportunities for students

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade The third grade standards primarily address multiplication and division, which are covered in Math-U-See

More information

Automatic document classification of biological literature

Automatic document classification of biological literature BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. Copyedited and fully formatted PDF and full text (HTML) versions will be made available soon. Automatic

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

SURVIVING ON MARS WITH GEOGEBRA

SURVIVING ON MARS WITH GEOGEBRA SURVIVING ON MARS WITH GEOGEBRA Lindsey States and Jenna Odom Miami University, OH Abstract: In this paper, the authors describe an interdisciplinary lesson focused on determining how long an astronaut

More information

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University Outline Introduction Bias and

More information

Unit 3 Ratios and Rates Math 6

Unit 3 Ratios and Rates Math 6 Number of Days: 20 11/27/17 12/22/17 Unit Goals Stage 1 Unit Description: Students study the concepts and language of ratios and unit rates. They use proportional reasoning to solve problems. In particular,

More information

Mathematics Success Level E

Mathematics Success Level E T403 [OBJECTIVE] The student will generate two patterns given two rules and identify the relationship between corresponding terms, generate ordered pairs, and graph the ordered pairs on a coordinate plane.

More information