BIOMEDICAL research is increasingly dependent on

Size: px

Start display at page:

Download "BIOMEDICAL research is increasingly dependent on"

James Farmer
6 years ago
Views:

1 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, Classification of protein-protein interaction full-text ocuments using text an citation network features Artemy Kolchinsky 1,2, Alaa Abi-Haiar 1,2, Jasleen Kaur 1, Ahme Abeen Hame 1 an Luis M. Rocha,1,2 Abstract We participate (as Team 9 in the Article Classification Task of the Biocreative II.5 Challenge: binary classification of fulltext ocuments relevant for protein-protein interaction. We use two istinct classifiers for the online an offline challenges: (1 the lightweight Variable Trigonometric Threshol (VTT linear classifier we successfully introuce in BioCreative 2 for binary classification of abstracts, an (2 a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemente the supplie training ata with full-text ocuments from the MIPS atabase. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top performing submission in this task, taking into account the rank prouct of the Area Uner the interpolate precision an recall Curve, Accuracy, Balance F-Score, an Matthew s Correlation Coefficient performance measures. The novel citation network classifier for the biomeical text mining omain, while not a top performing classifier in the challenge, performe above the central tenency of all submissions an therefore inicates a promising new avenue to investigate further in bibliome informatics. Inex Terms Text Mining, Literature Mining, Binary Classification, Protein-Protein Interaction, Citation Network. 1 BACKGROUND AND DATA BIOMEDICAL research is increasingly epenent on the automatic analysis of atabases an literature to etermine correlations an interactions amongst biochemical entities, functional roles, phenotypic traits an isease states. The biomeical literature is a large subset of all ata available for such inferences. Inee, the last ecae has witnesse an exponential growth of metabolic, genomic an proteomic ocuments (articles being publishe [1]. Pubme [2] encompasses a growing collection of more than 18 million biomeical articles escribing all aspects of our collective knowlege about the bio-chemical an functional roles of genes an proteins in organisms. Biomeical literature mining is a fiel evote to integrating the knowlege currently istribute in the literature an a large collection of omain-specific atabases [3], [4]. It helps us tap into the biomeical collective knowlege (the bibliome, an uncover new relationships an interactions inuce from global information but unreporte in iniviual experiments [5]. The BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology challenge evaluation is an effort to enable comparison of various approaches to literature mining. Its greatest value, perhaps, is that it consists of a community-wie effort, leaing many ifferent groups to test their methos against a common set 1 School of Informatics, Iniana University, USA, 2 FLAD Computational Biology Collaboratorium, Instituto Gulbenkian e Ciência, Portugal. Corresponing author. rocha@iniana.eu of specific tasks, thus resulting in important benchmarks for future research [6], [7]. In most literature or text mining projects in biomeicine, one nees first to collect a set of relevant ocuments for a given topic of interest such as proteinprotein interaction. But manually classifying articles as relevant or irrelevant to a given topic of interest is very time consuming an inefficient for curation of newly publishe articles [4] an subsequent analysis an integration. The problem of automatic binary classification of ocuments has been explore in several omains such as Web Mining [8], Spam Filtering [9] an Document Classification in general [10], [11]. The machine learning fiel has offere many solutions to this problem [12], [11], incluing methos evote to the biomeical omain [4]. However, in contrast to performance in well-prepare theoretical scenarios, even the most sophisticate solutions ten to unerperform in more realistic situations such as the BioCreative challenge (for example, by over-fitting in the presence of rift between testing an training ata. We participate (as Team 9 in the online an offline parts of the Article Classification Task (ACT of the BioCreative II.5 Challenge, which consiste of the binary classification of full-text ocuments as relevant or nonrelevant to the topic of protein-protein interaction (PPI. In most text-mining projects in biomeicine, one nees first to collect a set of relevant ocuments, typically from information in abstracts. To avance the capability of the community on this essential selection step, binary classification of abstracts was the focus of one of the tasks of the previous Biocreative classification challenge [13]. For this challenge, the objective was instea to clas-

2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, sify full-text ocuments, which allowe us to evaluate the possible aitional value of full-text information in this selection problem. The ACT subtask in BioCreative II.5, in particular, aime to evaluate classification performance between relevant an irrelevant ocuments to PPI. Naturally, tools evelope for ACT have great potential to be applie in many other literature mining contexts. For that reason, we use two very general classifiers which coul easily be applie to other omains an porte to ifferent computer infrastructure: (1 the lightweight Variable Trigonometric Threshol (VTT linear classifier we successfully introuce in the abstract classification task of BioCreative 2 (BC2 [5], an (2 a Naive Bayes classifier using features extracte from the citation network of the relevant literature. We participate in the online submission with our own annotation server implementing the VTT algorithm via the BioCreative MetaServer platform. The Citation Network Classifier (CNC runs were submitte via the offline component of the Challenge. We shoul note that VTT oes not require the use of specific atabases or ontologies, an so can be porte easily an applie to other omains. In aition, since full-text ata contains a wealth of citation information, we evelope an teste the novel CNC on its own an integrate with VTT. We were given 61 PPI-relevant an 558 PPI-irrelevant full-text training ocuments. We supplemente this ata by collecting aitional full-text ocuments appropriately classifie in the previous BC2 training ata [13] as well as in the MIPS atabase [14]. For VTT training purposes, we create two atasets: the first containe exactly 4x558 = 2232 ocuments, where the PPI-relevant set is comprise of 558 ocuments from BC2 plus 558 oversample instances of the 61 relevant ocuments from this challenge. The PPI-irrelevant set is comprise of 558 ocuments from BC2 an the 558 irrelevant ocuments provie with this challenge. The secon training set contains 370 PPI-relevant ocuments extracte from MIPS an 370 ranomly sample irrelevant ocuments from BC2. 2 VARIABLE TRIGONOMETRIC THRESHOLD CLASSIFICATION 2.1 Wor-pair an Entity Features Since classification ha to be performe in real time for the online part of this challenge, we use the lightweight VTT metho we previously evelope [5] for Biocreative 2. This metho, loosely inspire by the spam filtering system SpamHunting [15], is base on computing a linear ecision surface (etails below from the probabilities of wor-pair features being associate with relevant an irrelevant ocuments in the training ata [5]. A reason for the lightweight nature of VTT is that such worpair features can be compute from a relatively small number of wors. We use only the top 1000 wors W obtaine from the prouct of the ranks of the TF.IDF measure [16] average over all ocuments per wor w, ptn ptp Fig SP features with largest S(w i, w j on the p T P /p T N plane; size of font proportional to value of S(w i, w j an a score S(w = p T P (w p T N (w that measures the ifference between the probabilities of occurrence in relevant (p T P (w an irrelevant (p T N (w training set ocuments (after removal of stop wors 1 an Porter stemming [17]. All incoming full-text ocuments were converte into orere lists of these 1000 wors, w W, in the sequence of occurrence in the text. The simplifie vector representation an pre-processing of incoming full-text ocuments makes this metho lightweight an appropriate for the online part of this challenge. Wors with the highest score S ten to be associate with either positive or negative abstracts an are assume to be goo features for classification. Since in this challenge we were ealing with full-text ocuments, rather than abstracts as in the previous BC2 challenge, in aition to the S score we also use the TF.IDF rank to select the best wor features. Specifically, we use the rank prouct [18] of TF.IDF with the S score, which resulte in better (k-fol classification of the training ata than using either score alone. The top 15 wors were: immunoprecipit, 2gpi, lysat, transfect, interact, omain, plasmi, vector, mutant, fusion, bea, antiboi, pacrg, two-hybri, yeast. From wor set W, we compute short-winow (SP an long-winow (LP wor-pair features (w i, w j. SP refer to wor-pair features comprise of ajacent wors in 1. The list of stopwors remove: i, a, about, an, are, as, at, be, by, for, from, how, in, is, it, of, on, or, that, the, this, to, was, what, when, where, who, will, the, an, we, were. Notice that wors with an between were kept.

3 p(x p(x IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, TABLE 1 Top 10 SP an LP wor-pair features ranke by S score SP w i, w j P T P P T N S interact,between protein,interact fusion,protein cell,transfect interact,protein accept, cell,lysat bin,omain transfect,cell yeast,two-hybri LP w i, w j P T P P T N S interact,interact us,interact between,interact shown,interact interact,bin suggest,interact protein,immunoprecipit interact,protein assai,interact omain interact x-uniqueabnerproteincounts the orere lists that represent ocuments 2 ; the orer in which wors occur is preserve, therefore (w i, w j (w j, w i. LP features refer to wor-pair compose of wors that occur within 10 wors of one another in the orere lists; in this case, the orer in which wors occur is not important, therefore (w i, w j = (w j, w i. We also compute the probability that such wor-pairs appear in a positive or negative ocument: p T P (w i, w j an p T N (w i, w j, respectively. Figure 1 epicts the 1000 SP features with largest S(w i, w j = p T P (w i, w j p T N (w i, w j plotte on a plane where the horizontal axis is the value of the probability of occurrence in a relevant ocument, p T P (w i, w j, an the vertical axis is the value of the probability of occurrence in an irrelevant ocument p T N (w i, w j ; we refer to this as the p T P /p T N plane. Table 1 lists the top 15 SP an LP wor pairs for score S(w i, w j. In our previous application of this metho in the BC2 challenge [5], we use as an aitional feature the number of proteins mentione in abstracts, as ientifie by an entity recognition algorithm such as ABNER [19]. However, since in this challenge we were ealing with full-text ocuments, it was not clear if such relevant entity counts woul help the classifier s performance as much as they i when classifying abstracts in BC2 especially since ABNER itself is traine only on abstracts. Therefore, we focuse on counting entity occurrences in specific portions of ocuments such as the abstract, the boy, figure captions, table captions, as well as combinations of these. In aition to protein mentions recognize by ABNER, we teste many other entities 2. Notice that the orere lists representing ocuments contain only wors in set W x-uniqueabnerproteincounts Fig. 2. Comparison of the counts of protein mentions as ientifie by ABNER in istinct passages of ocuments in the training ata. Top figure epicts the counts of ABNER protein mentions in the boy section, whereas bottom figure epicts the counts of ABNER protein mentions in figure captions an abstracts. In these figures, the horizontal axis represents the number of mentions x, an the vertical axis the probability p(x of ocuments with at least x mentions. The blue circles enote ocuments labele relevant, while the re squares enote ocuments labele irrelevant; the green triangles enote the ifference between blue an re lines. ientifie by ABNER an an ontology-base annotator (which matche terms in text to PPI terms extracte from the Gene Ontology, the Protein-Protein Interaction Ontology, the Protein Ontology, an the Disease ontology. Since the aitional ABNER an ontologybase features i not lea to the ientification of entity

4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, features that seeme to istinguish PPI-relevant from irrelevant ocuments (as iscusse below, we o not escribe the process of extracting such features here. The only entity feature that prove useful in iscriminating relevant an irrelevant ocuments in the training ata was the count of protein mentions in abstracts an figure captions as recognize by ABNER. Figure 2 epicts a comparison of the counts of ABNER protein mentions in two specific portions of all ocuments of the Biocreative II.5 training ata: the boy, an the abstract plus figure captions. As can be seen, the counts of protein mentions in the boy of the full-text ocuments in the training ata oes not iscriminate between relevant an irrelevant ocuments. In contrast, the same counts restricte to abstracts an figure caption passages are ifferent for relevant an irrelevant ocuments. We use this type of plot to ientify which features an which ocument portions behave ifferently for relevant an irrelevant ocuments; only the counts of ABNER protein mentions in abstracts an figure captions were sufficiently istinct between the two classes. Base on observations of plots such as those epicte in Figure 2, we ecie not to test these aitional features on training ata. It is not possible for us to ientify exactly why the entity count features we teste faile to iscriminate between ocuments labele relevant an irrelevant in the training ata. Because we ha no access to annotations of protein mentions on the full-text corpus, we cannot compute the failure rates of the entity recognition tools we use (i.e. ABNER. 2.2 Methos The ieal wor-pair features in the p T P /p T N plane are those closest to either one of the axes. Any feature w is a vector on this plane (see figure 3, therefore feature relevance to each of the classes can be measure with the traitional trigonometric measures of the angle (α between this vector an the p T P axis: cos(α is a measure of how strongly features are associate with positive/relevant ocuments, an sin(α with negative/irrelevant ones in the training ata. Then, for every ocument, we compute the sum of all feature contributions for a positive (P an negative (N ecision: P ( = w N( = w cos(α(w = w sin(α(w = w p T P (w p 2 T P (w + p 2 T N (w, (1 p T N (w p 2 T P (w + p 2 T N (w The ecision of whether ocument is a member of the PPI-relevant (TP or irrelevant(tn set of ocuments is then compute as: { T P, T N, if P ( N( λ 0 + β n k( k β otherwise (2 Fig. 3. Trigonometric measures of term relevance in the P T P /P T N plane; P T P an P T N compute from labele ocuments in training ata. where λ 0 is a constant threshol for eciing whether a ocument is positive/relevant or negative/irrelevant. This threshol is subsequently ajuste for each ocument with the factor (β k n k(/β, where β is another constant, an k n k( is a series of counts of topic-relevant entities in ocument. As iscusse above, the only entity that prove useful in iscriminating between relevant an irrelevant ocuments in the training ata of the BC II.5 challenge was the count of protein mentions in the abstracts an figure captions of ocuments as recognize by ABNER. Therefore, in this case k n k( becomes simply np(, which is the number of protein mentions in the abstract an figure captions of. In formula 2, the classification threshol linearly ecreases as k n k( increases. The assumption is that the more relevant entities are recognize in a ocument, the higher the chances that the ocument is relevant. In this case, this means that the higher the number of ABNER-recognize protein mentions, the easier it is to classify a ocument as PPI-relevant; conversely, the lower the number of protein mentions, the easier it is to classify a ocument as PPI-irrelevant. When k n k( = β, the threshol is simply λ 0. We refer to this classification metho as Variable Trigonometric Threshol (VTT. Examples of the ecision surface for training ata are epicte in Figures 4 an 5, an are explaine below. A measure of confience in the classification ecision for ranking ocuments is naturally erive from formula 2: confience shoul be proportional to the value δ( = P (, N( T ( where T ( = λ0 + β n k( k β is the threshol point for ocument. Thus, the further away from the ecision surface a ocument is, the higher the confience in the ecision. Therefore, δ( is a measure of istance from a ocument s ratio of feature weights (P (/N( to the ecision surface or threshol point for that ocument, T (. Since BC II.5 require a confience value in [0, 1], we use the following measure of confience C of the ecision mae for a ocument :

5 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, 2010 TABLE 2 VTT parameters for online runs. (3 Training of the VTT classifier consiste of exhaustively searching the parameters λ0 an β that efine its linear surface, while oing k-fol cross-valiation (K = 8 on both of the training ata sets escribe in section 1: the first with ocuments from the BC2 an BC II.5 challenges, an the secon with aitional MIPS ata. We swept the following parameter range: λ0 [0, 10] an β [1, 100], in steps of λ = an β = 1. For each (λ0, β pair, we compute the mean of the Balance FScore (F1 an Accuracy measures for the 8-fols of each training ata set3. Given the two training ata sets an two performance measures, we chose VTT parameter-sets to be those that minimize the prouct of ranks obtaine from computing each performance measure on a specific training ata set. More specifically, we compute four ranks for each classifier teste in the parameter search T1 rank classifiers accoring to the mean stage: rft1 an ra value of F-Score an Accuracy in the 8-fols of the first T2 rank classifiers training ata set, respectively; rft2 an ra accoring to the mean value of F-Score an Accuracy in the 8-fols of the secon training ata set, respectively. We then ranke all classifiers teste accoring to the rank T1 T2 1/4 prouct of these four ranks: R = (rft1 ra rft2 ra [18]. This proceure was performe for the two istinct wor-pair feature sets: SP an LP. Our training strategy was base on a balance scenario with equal numbers of positive (PPI-relevant an negative ocuments. We then submitte 5 runs to the online challenge: 1 Best parameter-set for SP features, which was the top performer in the first training ata set (ata from BC2 an BC II.5 when using SP features. 2 Best parameter-set for LP features, which was the top performer in both training ata sets when using LP features. 3 Secon-best parameter set for SP features, which was the top performer in the secon training ata set (ata from MIPS when using SP features. 4 Best parameter set for SP features without the variable threshol compute from ABNER s entity recognition (np( = β, an traine only on the first training ata set (no MIPS ata. P +T N 2.T P an F1 = 2T P +F, where 3. Accuracy = T P +FT P +T N +F N P +F N T P, T N, F P, an F N refer to true positives, true negatives, false positives, an false negatives, respectively. np( Training Run 2 LP Y Y P( / N( np( 2.3 Run 1 SP Y Y Run 3 SP Y Y Run 4 SP N N Run 5 LP N N np( Feature Set Entity Feature MIPS ata β λ0 where max δ( is the maximum value of istance elta foun in the training ata. If a test ocument t results in a δ(t that is larger than max δ(, C(t = 1. In BC II.5, we ranke positive ocuments by ecreasing value of C, followe by negative ocuments ranke by increasing value of C. P( / N( np( δ( C( = max (δ( 5 P( / N( P( / N( Fig. 4. VTT ecision surface for λ0 = an β = 36 for the ocuments in 4- of the 8-fols of the first training ata set, using SP feature set (parameters use in Run 3. Horizontal axis correspons to the value of P (/N ( an vertical axis correspons to the value of np(, for each ocument. Black (ocuments from BC II.5 challenge an gray (ocuments from BC2 challenge circles represent positive ocuments, whereas re (ocuments from BC II.5 challenge an orange (ocuments from BC2 challenge circles represent negative ocuments. 5 Best parameter set for LP features without the variable threshol compute from ABNER s entity recognition (np( = β, an traine only on the first training ata set (no MIPS ata. The VTT parameter-sets for these five runs are summarize in table 2. Figures 4 an 5 epict the VTT ecision surfaces with some of the submitte parameters for the two training ata sets an wor-pair features. 2.4 Results During the online part of the challenge, two minor technical issues arose. The first was an inconsistency in the Unicoe ecoing of online-submitte ocuments that cause some features not to be extracte correctly. The secon was a caching problem that cause miscalculation of ABNER counts (entity feature, see 2.1 for many

6 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, TABLE 4 VTT scores after Unicoe an ABNER cache correction. np( Run 1 Run 2 Run 3 Run 4 Run 5 TP FP FN TN Specificity Sens./Recall Precision F Accuracy MCC P at Full R AUC ip/r P(/N( Fig. 5. VTT ecision surface for λ 0 = an β = 72 for the ocuments in 1- of the 8-fols of the secon training ata set, using LP feature set (parameters use in Run 2. Horizontal axis correspons to the value of P (/N( an vertical axis correspons to the value of np(, for each ocument. Black circles represent positive ocuments (from MIPS, whereas re circles represent negative ocuments (from the BC II.5 Challenge. TABLE 3 Official VTT scores for online runs. Run 1 Run 2 Run 3 Run 4 Run 5 TP FP FN TN Specificity Sens./Recall Precision F Accuracy MCC P at Full R AUC ip/r ocuments. Despite these errors, all of the submitte runs performe very well. The official scores of the five runs against the online test set are provie in table 3. After the challenge, we correcte the Unicoe an ABNER cache errors an compute new performance measures for the same five classifier parameters (see table 2 4. The correcte scores are shown in table 4. Notice that the re-submitte runs i not entail retraining the classifiers using information from the test ata available after the challenge. Inee, we use the same VTT parameters in the original an re-submitte runs (table 2, as obtaine by the reproucible training algorithm escribe in 2.3. We present the correcte results to emonstrate the merits of the metho com- 4. We use the gol stanar an evaluation script provie by the competition organizers after the BC II.5 challenge; we ae the calculation of Precision, Recall an Balance F-Score. pute without errors, especially because it is important to etermine the benefits of using entity recognition via ABNER, the algorithm component which was most irectly affecte by the errors. Because there are various ways to measure misclassification (type I an II errors given the confusion matrix of (the number of True Positives (TP, False Positives (FP, True Negatives (TN, an False Negatives (FN, there is no perfect way to characterize the performance of binary classifiers [20]. Therefore, it is important to compute performance using various measures [21]. One reasonable way to obtain an overall ranking of performance of a binary classifier c is to combine a few stanar measures via the rank prouct [18]: RP (c = k k r c,m (4 m=1 where k is the number of measures consiere an r c,m is the rank of the performance of classifier c accoring to measure m. The best classifiers are then those that minimize overall RP. To provie a well-roune assessment of performance using the rank prouct, well-establishe performance measures with istinct characteristics are neee. The Biocreative II.5 challenge evaluation relies on various measures of performance; we center our iscussion on four of them: Area Uner the interpolate precision an recall Curve (AUC, Accuracy, Balance F-Score (F 1, an Matthew s Correlation Coefficient (MCC. AUC [22], [23] was the preferre measure of performance for this challenge as it is robust an ieal for evaluating the quality of ranke results for all recall percentages. Nonetheless, it oes not account irectly for misclassification errors; for instance, the runs submitte by team 13 5 labele every ocument as positive, yet ha the 6th best AUC in the challenge (r 13,AUC = 6, after runs from team 20 6 an our own team 9. Accuracy is the proportion of true results, which is a stanar measure for assessing the performance of binary classification [20], [21]. F 1 is also 5. Hongfang Liu s team at Georgetown University. 6. Kyle Ambert an Aaron Cohen at Oregon Health & Science University.

7 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, TABLE 5 Central tenency an variation of performance measures for all submissions to the ACT of the BC II.5 Challenge Accuracy MCC P at Full R AUC F 1 Mean St. Dev Meian % Conf % Conf a stanar measure of classification effectiveness [20]; it is a balance measure of the proportion of correct results from the returne results (precision an from those that shoul have been returne (recall. Because F 1, unlike Accuracy, oes not epen on the number of true negatives, it is important to take into account both measures, especially in the unbalance scenario of this challenge where the abunance of negative (irrelevant articles leas to high values of the Accuracy measure for classifiers biase for negative classifications [21]. The MCC measure 7 [24] is a well-regare, balance measure for binary classification very well suite for unbalance class scenarios such as this challenge [21]. These four measures assess istinct aspects of binary classification, thus yieling a well-roune view of performance when combine via the rank prouct of formula 4. There is no nee to inclue other performance measures such as sensitivity an specificity in the set of measures in our performance rank prouct: sensitivity is the same as recall 8, alreay taken into account by the F-Score, an specificity (or True Negative Rate is of little utility when classes are unbalance with many more negative (irrelevant ocuments, as in this challenge. Moreover, incluing these two measures in our rank prouct oes not change the rank of the top two performing runs for the entire challenge (for original or re-submitte runs. All five of our submitte runs were well above the central tenency of the runs submitte by all teams (in the collection of online an offline submissions. Inee, the performance of all of our submitte runs are above the 95% confience interval of the mean of all submitte runs. Table 5 epicts the central tenency an variation of the performance measures for the runs submitte to the challenge by all participating groups. Table 6 shows the overall top five original runs submitte to the ACT of the BC II.5 Challenge, ranke in increasing value of the rank prouct of formula 4; Table 7 shows the overall top five runs after correction of the Unicoe an ABNER cache errors. Accoring to the rank prouct of the four measures iscusse above, our correcte, post-challenge run 1 is the top classifier, followe by the best run from team 20 an our other 4 runs (5, 4, 2, 3, respectively. If 7. MCC = 8. T P T P +F N (T P.T N F P.F N (T P +F P (T P +F N(T N+F P (T N+F N TABLE 6 Rank Prouct performance of top 5 original submissions to the ACT of the BC II.5 Challenge. Also shown are iniviual ranks for the four constituent performance measures. Runs RP AUC F 1 Accuracy MCC Team Team 9: Team 9: Team 9: Team TABLE 7 Rank Prouct performance of top 5 submissions to the ACT of the BC II.5 Challenge, after Unicoe an ABNER cache correction. Also shown are iniviual ranks for the four constituent performance measures. Runs RP AUC F 1 Accuracy MCC Team 9: Team Team 9: Team 9: Team 9: we o not consier our re-submitte runs, then the best run from team 20 is the top performer, followe by our submitte official runs 5, 4, an 1, followe by three runs from Team Therefore, even without consiering our re-submitte runs, the VTT classifier was one of the top two performers overall. Looking at the four measures of performance iniviually, of the original submissions VTT run 5 was the top performer for MCC an F 1, while VTT run 4 was the top performer for Accuracy an secon-best for AUC after team 20. When we consier the resubmitte runs, VTT run 1 was the top performer for Accuracy, MCC, an F 1, while VTT run 4 achieve the best AUC score which was the preferre performance measure in the challenge. However, when we consier the other performance measures, this classifier was not our best performer. Using the rank prouct measure, we conclue that the parameter-set use for Run 1, once properly compute in Run 1, le to the most wellroune classifier an the top performer for Accuracy, MCC, an F 1, while at the same time obtaining quite a goo AUC score. The presence of the entity (ABNER counts feature ifferentiates Runs 1 an 4. We observe that using this feature le to the most well-roune submission (Run 1, but not using it le to the best AUC measurement (Run 4. We also observe that the use of aitional MIPS ata for training purposes i not lea to any improvement in this challenge, as the parameter-sets for Runs 1 an 4 were also the best foun for the first ata set alone. Moreover, Run 3 (an 3, which use the best parameter-set for training on MIPS ata, was our poorest 9. The team of Yong-gang Cao, of the University of Wisconsin- Milwaukee.

8 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, AUC Accuracy Fig. 6. Accuracy an AUC performance of VTT runs in comparison with the other top performing submission (group 20. The portion of the plane shown is well above the 95% confience interval of the mean for all submissions to the ACT of the BC II.5 challenge. Blue iamons represent the official VTT online submissions, an the re squares represent the same runs after fixing Unicoe an ABNER cache errors. The green triangle represents the other top performer in this challenge. F1 MCC Fig. 7. MCC an F 1 performance of VTT runs in comparison with the other top performing submission (group 20. The portion of the plane shown is well above the 95% confience interval of the mean for all submissions to the ACT of the BC II.5 challenge. Blue iamons represent the official VTT online submissions, an the re squares represent the same runs after fixing the Unicoe an ABNER cache errors. The green triangle represents the other top performer in this challenge. performer. Finally, we o not observe a istinct benefit of using one or the other type of wor-pair features: while the SP feature set was use in the our best run (1, the LP feature set was use in our secon-best run (5. Figures 6 an 7 epict in graphical form the performance of our submissions for the 4 performance measures above, in comparison with the other top performer (the classifier from Team 20 in the ACT component of this challenge. 3 CITATION NETWORK CLASSIFIER 3.1 Metho We also evelope the Citation Network Classifier (CNC to ientify PPI-relevant articles using features extracte from citations an aitional information erive from the citation network of the bibliome. We i not employ this classifier in the online part of the challenge because citation information was only available in the offline, XML-version of the test set. Its lightweight performance, however, makes it suitable for real-time classification. We implemente this metho using a Naive Bayes classifier on the following equally-weighe citation features: (1 cite PubMe IDs (PMIDs (2 citation authors an (3 citation author/year pairs. We calculate p(class = P P I F eature = f an p(class = non-p P I F eature = f for the features foun in the ocuments in the training set, smoothe the istributions using Laplace s rule (smoothing parameter of 0.01, an selecte the top features using their Chi-square rank (top features in runs 1, 2, 4 an 5, an the top in run 3. Aitionally, uring scoring we treate each ocument s own authors as if they were cite by that article three times; this allowe authorship information to be inclue an play a role in improving classifier performance. During classification, each ocument was assigne to the class with the Maximum A Posteriori probability (MAP ecision rule given that ocument s features. An uninformative equiprobable class prior was use. Aitionally, I(Class; F eatures the mutual information between a ocument s class an citation features was use as a classification confience score. It was calculate as the ecrease in uncertainty (entropy between the prior an posterior class istributions: I(Class; F eatures = H(Class H(Class F eatures Because the uncertainty present in the prior class istribution of a binary classifier is at most 1 bit, an because entropy is always positive an oes not increase uner conitioning [25], this quantity naturally falls in the unit interval. One significant issue encountere uring the implementation of this classifier was the lack of an easilyaccessible atabase of biological citations, or a comprehensive repository of parsable biological articles from which one coul easily be built. We create our own citation atabase using a combination of scraping an parsing scripts. Starting from a list of PMID from the training ata for which citation ata was neee, we querie PubMe for publication information an then attempte to locate an ownloa articles in PDF format from journal websites. When a PDF-version of an article was retrieve, its raw textual content was first obtaine using the pf2text converter, then the parscit parser [26] was use to extract XML-formatte bibliographic references. Successfully parse reference ata was converte into PMIDs using the PubMe search API, which

9 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, resulte in a list of cite PMIDs for each initial PMID. Our scripts were initially run on articles cite by ocuments in the BC II.5 training set; further iterations then looke for articles cite by those articles, an so on recursively. Using this metho, we acquire approximately PDF files, from which approximately PMIDs, reference PMIDs, an citations were extracte. The set of cite articles an authors to be foun in test ata is potentially enormous. Moreover, the training ata provies class information (P (Class F eature istributions for only a small number of citation features. Using co-citations allowe this class information to iffuse over the links of the harveste citation network. For this purpose, we use a cocitation measure from feature A to feature B: ω(a, B = # times feature A is co-cite with feature B # times feature A is co-cite total When a citation feature without class information was foun in a test article, its class istribution was approximate as a linear combination of the weights of the eges to its neighbors in a cocitation network efine by the ω(a, B measure escribe above. This network was built using the three types of citation features PMIDs, authors, an author/year pairs. Feature co-citations that occurre only once were eliminate in all our runs. It shoul be note that the cocitation network is a irecte weighte graph, since the cocitation measure above is not symmetric. An asymmetry woul result if one article or author was usually cite in combination with another, but the latter was also cite in many cases where the former was not. In this situation, the former woul have a stronger ω weight to the latter than vice-versa. Finally, we also integrate the CNC with the VTT classifier, configure with the parameter values use in our online submission 4. This was one in the following manner: if the istance of a ocument to the ecision surface of VTT, as quantifie by the δ( measure explaine in 2.2, was above a certain constant, the VTT result was use, otherwise class membership was ecie by the classifier with largest confience (VTT or CNC. In that case, the combine confience was the sum (ifference of the confience values of the two classifiers when they agree (isagree in their class label assignment, ivie by Results The CNC was traine on the combination of the Biocreative II.5 training set (595 ocuments 10 an the Biocreative 2 training set (5495 ocuments. The 10 most informative features foun by CNC are liste in table 8. The PubMe IDs in this table refer to two highly-cite 10. While the initial training set release for BC II.5 containe = 609 articles, a subsequent version of the training set containe only = 595 articles. We use the first set in the training of VTT for the online challenge, but the more recent one in the training of CNC for the offline challenge. TABLE 8 Highest scoring features foun by the CNC algorithm. Citation P (F P P I P (F non-p P I PMID: E E-04 Ellege SJ 2.80E E-05 Gygi SP 2.19E E-04 Fiels S 2.99E E-05 Gorg A 1.83E E-04 Sanchez JC 9.12E E-04 PMID: E E-04 Creasy DM 4.57E E-04 Cooper JA 1.99E E-05 Aebersol R 5.02E E-04 TABLE 9 CNC parameters for offline runs. Run 1 Run 2 Run 3 Run 4 Run 5 # Features Co-citation ata N Y Y N Y Mix with VTT N N N Y Y TABLE 10 Official CNC scores for offline runs. Run 1 Run 2 Run 3 Run 4 Run 5 TP FP FN TN Specificity Sens./Recall Precision F Accuracy MCC P at Full R AUC ip/r protein-relate but not PPI-relate articles ([27], [28], which were foun frequently in the negative training ata. Among the other authors liste, Ellege SJ, Fiels S, an Cooper JA have all publishe important works in the PPI omain, while the remaining have publishe extensively in proteomics-relate (but again, not PPI-relate literature. We submitte 5 runs to the offline challenge: 1 Naive Bayes classifier using the top citation features. 2 Same as (1 but where citation features are supplemente with the co-citation weight ω. 3 Same as (2 but with top citation features. 4 Same as (1 but in combination with VTT as escribe above, using a VTT confience cutoff parameter of Same as (2 but in combination with VTT as escribe above, using a VTT confience cutoff parameter of The parameter-sets for these runs are liste in table 9. Table 10 shows the official performance for these five runs submitte to the offline challenge. The performance of the offline CNC runs was lower than what we obtaine for VTT in the online part of

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, 2010 10 TABLE 11 CNC scores after algorithm corrections. PatFulR 2 F1 Fig. 8.

10 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, TABLE 11 CNC scores after algorithm corrections. PatFulR 2 F1 Fig. 8. F 1 an P at Full R performance of offline CNC runs in comparison with the other top performing submission (group 20. Also shown as an orange rectangle is the 95% confience interval of the mean for all submissions to the ACT of the BC II.5 challenge, for these two performance measures. The black cross enotes the mean value, an the gray star the meian. Blue iamons represent the official VTT online submissions, an the re squares represent the same runs after fixing the Unicoe an ABNER cache errors. Blue circles represent the CNC runs; we can see that Runs 4 an 5 are clearly top accoring to the P at Full R performance measure. The green triangle represents the other top performer in this challenge. the challenge. Nonetheless, for most performance measurements, these runs were still above the mean value for all submissions to the BC II.5 challenge; all of F 1 an most of the MCC measurements were above the meian value, an all measurements of Accuracy were above the 95% confience of interval of the mean. Runs 4 an 5, which combine CNC with VTT, lea to measurements of AUC, Accuracy, MCC, an F 1 above the 95% confience interval of the mean, though still below the online submissions with VTT alone. Interestingly, these runs also lea to the top two measurements of Precision at Full Recall (P at Full R for the entire challenge, both well above the 99% confience interval of the mean of all submissions. While the P at Full R measure is not a measure of overall goo performance for binary classification, this result shows that integrating CNC with VTT leas to an improvement in the rate of misclassifications, if we want to guarantee full recall (retrieval of every relevant ocument. Figure 8 epicts in a graphical form the performance of all our submissions for the F 1 an P at Full R measures. Unfortunately, after the challenge we iscovere several issues that affecte the performance of our CNC submissions in the offline ACT challenge. First, some improperly parse ata neee to be remove from the citation network atabase. More importantly, the classifier s AUC scores were iminishe because the original CNC confience score was not properly normalize; the mutual-information-base confience score calculation was only correcte post-challenge. In aition, two pa- Run 1 Run 2 Run 3 Run 4 Run 5 TP FP FN TN Specificity Sens./Recall Precision F Accuracy MCC P at Full R AUC ip/r rameters were ae in orer to increase co-citation algorithm spee an ecrease the sprea of spurious correlations: for features lacking class istributions, one parameter limite potential co-citation neighbors to only a given number of top traine features (as ranke by Chi-square score, while the other parameter limite co-citation links to cases where ω was above a certain threshol. The settings of these parameters 800 top features an an ω threshol of 0.3 were chosen by picking parameter values that maximize F 1 scores when teste on the BC II.5 training set after training on the BC2 training set. Revise scores for the CNC are shown in table 11, where we can see that the performance obtaine for the four most important measures improve. Though the performance of P at Full R slightly ecline, it still remaine well above the performance of all other submissions to the challenge. From the ifference between Run 1 an Run 2 as well as Run 4 an Run 5, we also observe that incluing co-citation ata reuce the number of false positives, resulting in an improvement in Accuracy an AUC. However, in terms of the rank prouct measure of performance (formula 4, this improvement is marginal: RP (CNC Run5 = 14.8, RP (CNC Run4 = 14.9, RP (CNC Run2 = 18.7, RP (CNC Run1 = 20.7, where these runs ranke 13 th, 14 th, 18 th, an 19 th, respectively, out of 37 total runs submitte to the ACT of the BC II.5 challenge. Interestingly, even with the postchallenge changes, combining CNC with the VTT algorithm using a VTT confience cutoff parameter of 0.35 improve CNC performance but coul not outperform VTT by itself. This was the case even in trials when CNC was mixe with VTT scores at a very low confience level (not shown. 4 DISCUSSION AND CONCLUSION From our previous work [5], we knew that the lightweight VTT metho performe well in the classification of PPI-relevant abstracts. Given our results in the ACT of the BC II.5 challenge, we can now conclue that it also performs very well in a full-text scenario. Inee, the VTT classifier, when correcte for the minor errors iscusse in 2.4, was able to out-perform every

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, 2010 11 np( np( P(/N( P(/N( np( np( P(/N( P(/N( Fig. 9.

11 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, np( np( P(/N( P(/N( np( np( P(/N( P(/N( Fig. 9. VTT ecision surface for the best four of five VTT submissions (after correction of Unicoe an ABNER Cache errors. Horizontal axis correspons to the value of P (/N( an vertical axis correspons to the value of np(, for each ocument. Black pluses represent positive ocuments, re circles represent negative ocuments. other submission to this challenge accoring to the rank prouct of the four main performance measures (table 7. Even when consiering the official VTT submissions (with Unicoe an ABNER cache errors, the best VTT run was the secon-best submission of the entire challenge accoring to the same measure (table 6; see 2.4 for etails. Interestingly, VTT uses only a small number of wors extracte from the text (1000, minimal entity recognition (protein mentions via the off-the-shelf ABNER [19], an a linear ecision surface. Yet, this metho was very competitive against more sophisticate systems in both the Biocreative 2 [5] an Biocreative II.5 challenges. Perhaps the key to the success of this lightweight metho in this challenge is the real-worl nature of the BioCreative ata sets. Because the testing an training ata are obtaine in realistic annotation an publication scenarios, rather than sample from prepare corpora with statistically-ientical feature istributions, more sophisticate machine learning classifiers ten to overfit the training ata without generalizing as well the concept of protein-interaction from the bibliome. The rift between training an testing ata was a real issue in BC2 [5], an we have evience that the same may have occurre in the BC II.5 challenge. We traine a classical classifier to istinguish between the training an testing corpora. Specifically, we use 4-fol cross-valiation to train on subsets of articles from the BC II.5 training an testing sets, now labele accoring to membership in the training or testing sets rather than PPI-relevance or irrelevance. Classifier features were selecte, after Porter-stemming an stopwor-removal, as the top 1000 single wors ranke accoring to their information-gain score [29]. Document vectors, with those same information-gain scores for term-weights, were use to train a Support Vector Machine (SVM classifier (we use the SVM-light package [30] with a linear kernel an efault parameters. Accoring to F-Score an AUC measures, the two corpora can be classifie an are therefore sufficiently istinct, exhibiting a significant amount of rift. When we use only PPI-relevant articles from the training an testing ata the SVM classifier obtaine: F 1 =0.63 an AUC=0.76. When we use only PPI-irrelevant articles the SVM classifier obtaine: F 1 = 0.54 an AUC=0.78. When we consiere both PPI-relevant an irrelevant articles, the SVM classifier obtaine: F 1 = 0.63 an AUC=0.79. All scores were average over eight 4-fol runs. If the training an testing ata were inistinguishable (rawn from the same statistical istribution AUC an F-Score woul be near 0.5. Clearly, this is not the case with this ata, nor shoul it be expecte from the real-worl scenario of BC II.5. We also see that rift occurs for both PPI-relevant an irrelevant articles. Figures 4 an 5 show how the positive an negative ocuments in the training ata, using our wor-pair features, can be easily separate by a linear surface. If we were to use a more sophisticate ecision surface, it is quite possible that they woul obtain much better class separation on the training ata. Inee, we alreay observe in BC2 that SVM an Singular Value Decomposition classifiers obtaine higher performance in the training ata than VTT (as measure by accuracy an F-Score, but lower in the testing ata [5]. Since VTT ha alreay been compare to traitional classifiers such as SVM [5], in this challenge we i not submit runs with those kins of classifiers an instea chose to test more parameters of the VTT an the novel CNC. Therefore, to ecie if algorithms submitte to the challenge with more sophisticate ecision surfaces suffere from the rift between training an testing, we woul nee access to their performance on the training ata, not just the available results on testing ata. Given the overall performance of VTT, we can at least say that this metho was highly competitive in ealing with the measurable rift between training an testing ata. Figure 9 epicts the ecision surfaces of the VTT metho for four (correcte submissions on the final test ata. While better surfaces clearly exist to classify the test ata, the linear surface of the VTT metho avoie over-

12 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, fitting, an was very goo at generalizing the concept of protein-protein interaction from the bibliome in the not fully statistically ieal, real-worl scenario of BC II.5 while remaining lightweight computationally. We also conclue that training with aitional ata from MIPS, which contains articles from various publication sources rather than a single journal, was not very avantageous. This seems to argue against the ability of the VTT metho to generalize the real-worl concept of protein-protein interaction. However, the real-worl in this task is the scenario of FEBS Letters curators attempting to ientify PPI relevant ocuments among the articles submitte to this journal all systems were ultimately only teste on the FEBS Letters test set, an not in etermining PPI-relevance at large. As for using features extracte using entity recognition, we can say that counting protein mentions via ABNER in abstracts an figure captions was moerately avantageous (though not using it le to a higher AUC score. We also observe uring training that using other entities from ABNER an relevant ontologies (see 2.1 was not avantageous. Therefore, while using ABNER protein counts i not lea to a large improvement in classification, it was the only entity we were able to ientify which le to a moerate improvement in classification using the VTT metho. The performance of the newly-introuce CNC algorithm in the ACT task was not competitive with the best content-base classifiers, but was still aboveaverage an provies a proof-of-concept emonstration of the applicability of the citation-network metho to the biomeical ocument classification omain. Our implementation points to several approaches that coul be investigate in the search for high-performance citationnetwork base classification. First, we i not use counts of how many times each reference was cite in a ocument, though use of such weighte features coul inicate the citations that are most informative about a given article s class label. Aitionally, incluing the title of the citing ocument section in the citation features coul lea to better performance. Different sections may reference articles for ifferent reasons; citations from the Methoology section, for example, may be particularly useful in ientifying ocuments relevant to a specific biomeical subfiel, as in the ACT task. Finally, another way to capture citation styles relevant to omain-specific classification woul involve combining citation features with statistically-significant tokens from citing sentences, which are known as citances an have alreay receive some attention in the biomeical text-mining fiel [31]. Performance of the CNC epens not only on the algorithm an training ata, but also on the unerlying citation atabase from which ω weights are compute. We observe (see 3.2 that incluing co-citation ata reuce the number of false positives, but ultimately le to a marginal overall performance improvement. The citation network use in our work, however, is extremely limite in coverage an subject to parsing errors. An accessible, high-quality repository of biomeical citation ata woul go a long way towars avancing citation-network base classifiers in the fiel. Inee, literature omains where such repositories exist, such as the publicity-available US patents atabase, have seen wier application of co-citation-base algorithms (see, for example, [32], [33]. In summary, we have shown that our VTT classifier, previously applie to abstracts only, is also very competitive in the classification of PPI-relevant ocuments in a real-worl, full-text scenario such as the one provie by BC II.5. Moreover, the novel CNC is the first application of a citation-base classifier to the PPI omain an is thus a promising new avenue for further investigation in bibliome informatics. AUTHORS CONTRIBUTIONS Artemy Kolchinsky evelope an implemente the CNC metho, helpe set up the online server, participate in various experimental an valiation computations, an helpe write the manuscript. Alaa Abi- Haiar helpe evelop the VTT metho, prouce the coe necessary for pre-processing abstracts an computing training ata partitions, participate in various experimental an valiation computations, an helpe with proucing figures for the manuscript. Jasleen Kaur helpe set up the online server as well as with ata preprocessing. Ahme Abeen Hame conucte feature extraction experiments from various ontologies. Luis M. Rocha was responsible for integrating the team an esigning the experimental set up, writing the manuscript, as well as eveloping the VTT metho. ACKNOWLEDGMENTS We are very thankful to the eitors an reviewers of this article for the very etaile an useful reviews provie. We woul like to acknowlege the help of Prerag Raivojac an Nils Schimmelmann, who provie the aitional MIPS ata use by our team. We woul also like to thank the FLAD Computational Biology Collaboratorium at the Gulbenkian Institute in Oeiras, Portugal, for hosting an proviing facilities use to conuct part of this research. REFERENCES [1] L. Hunter an K. Cohen, Biomeical language processing: What s beyon pubme? Molecular Cell, vol. 21, no. 5, pp , [2] Pubme. [Online]. Available: [3] H. Shatkay an R. Felman, Mining the biomeical literature in the genomic era: An overview, Journal of Computational Biology, vol. 10, no. 6, pp , [4] L. J. Jensen, J. Saric, an P. Bork, Literature mining for the biologist: from information retrieval to biological iscovery. Nat Rev Genet, vol. 7, no. 2, pp , Feb [Online]. Available:

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, 2010 13 [5] A. Abi-Haiar, J. Kaur1, A. Ma

9(Suppl 2:S11, 2008. [Online]. Available: http://genomebiology.com/2008/9/s2/s11/abstract/ [6] L. Hirschman, A. Yeh, C. Blaschke, an A.

[7] Proceeings of the Secon BioCreative Challenge Evaluation Workshop, vol. ISBN 84-933255-6-2, 2007. [8] S. Chakrabarti, Mining the Web: Analysis of Hypertext an Semi Structure Data.

13 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, [5] A. Abi-Haiar, J. Kaur1, A. Maguitman, P. Raivojac, A. Retchsteiner, K. Verspoor, Z. Wang, an L. M. Rocha, Uncovering protein interaction in abstracts an text using a novel linear moel an wor proximity networks, Genome Biology, p. 9(Suppl 2:S11, [Online]. Available: [6] L. Hirschman, A. Yeh, C. Blaschke, an A. Valencia, Overview of biocreative: critical assessment of information extraction for biology, BMC Bioinformatics, vol. 6 Suppl 1, p. S1, [7] Proceeings of the Secon BioCreative Challenge Evaluation Workshop, vol. ISBN , [8] S. Chakrabarti, Mining the Web: Analysis of Hypertext an Semi Structure Data. Morgan Kaufmann, [9] I. Anroutsopoulos, J. Koutsias, K. V. Chanrinos, an C. D. Spyropoulos, An experimental comparison of naive bayesian an keywor-base anti-spam filtering with personal messages, Annual ACM Conference on Research an Development in Information Retrieval, [Online]. Available: [10] T. Joachims, Learning to classify text using support vector machines: methos, theory, an algorithms. Kluwer Acaemic Publishers, [11] R. Felman an J. Sanger, The Text Mining Hanbook: avance approaches in analyzing unstructure ata. Cambrige: Cambrige University Press, [12] F. Sebastiani, Machine learning in automate text categorization, ACM Computing Surveys (CSUR, vol. 34, no. 1, [Online]. Available: [13] M. Krallinger an A. Valencia, Evaluating the etection an ranking of protein interaction relevant articles: the biocreative challenge interaction article sub-task (ias, in Proceeings of the Secon Biocreative Challenge Evaluation Workshop, [14] H. W. Mewes, C. Ami, R. Arnol, D. Frishman, U. Gulener, G. Mannhaupt, M. Munsterkotter, P. Pagel, N. Strack, V. Stumpflen, J. Warfsmann, an A. Ruepp, Mips: analysis an annotation of proteins from whole genomes. Nucleic Acis Res, vol. 32, no. Database issue, pp. D41 D44, Jan [Online]. Available: [15] F. Fez-Riverola, E. Iglesias, F. Diaz, J. Menez, an J. Corchao, Spamhunting: An instance-base reasoning system for spam labelling an filtering, Decision Support Systems, vol. In Press, [Online]. Available: 4MR7D2T-1/2/1ea88c60a24a977e08f2be9c577f6b0 [16] G. Salton an C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing & Management, vol. 24, no. 5, pp , [Online]. Available: [17] M. Porter, An algorithm for suffix stripping, Program, vol. 13, no. 3, pp , [18] R. Breitling, P. Armengau, A. Amtmann, an P. Herzyk, Rank proucts: a simple, yet powerful, new metho to etect ifferentially regulate genes in replicate microarray experiments. FEBS letters, vol. 573, no. 1-3, pp , August [Online]. Available: [19] B. Settles, Abner: an open source tool for automatically tagging genes, proteins an other entity names in text, Bioinformatics, vol. 21, no. 14, pp , [20] R. Baeza-Yates an B. Ribeiro-Neto, Moern Information Retrieval. New York: ACM Press, Aison-Wesley, [21] P. Bali, Assessing the accuracy of preiction algorithms for classification: an overview, Bioinformatics, vol. 16, no. 5, pp , May [Online]. Available: [22] L. E. Do an M. S. Pepe, Partial auc estimation an regression, Biometrics, vol. 59, no. 3, pp , [Online]. Available: [23] T. Fawcett, An introuction to roc analysis, Pattern Recognition Letters, vol. 27, no. 8, [Online]. Available: [24] B. W. Matthews, Comparison of the preicte an observe seconary structure of t4 phage lysozyme. Biochimica et biophysica acta, vol. 405, no. 2, pp , October [Online]. Available: [25] T. Cover an J. Thomas, Elements of information theory. John Wiley & Sons, New York, [26] I. Councill, C. Giles, an M. Kan, Parscit: An open-source crf reference string parsing package, in Proceeings of LREC, [27] U. Laemmli et al., Cleavage of structural proteins uring the assembly of the hea of bacteriophage t4, Nature, vol. 227, no. 5259, pp , [28] D. Perkins, D. Pappin, D. Creasy, J. Cottrell et al., Probabilitybase protein ientification by searching sequence atabases using mass spectrometry ata, Electrophoresis, vol. 20, no. 18, pp , [29] Y. Yang an J. O. Peersen, A comparative stuy on feature selection in text categorization, Proceeings of the Fourteenth International Conference on Machine Learning, [Online]. Available: [30] T. Joachims, Making large-scale support vector machine learning practical, Avances in kernel methos: support vector learning, [Online]. Available: [31] P. Nakov, A. Schwartz, an M. Hearst, Citances: Citation sentences for semantic analysis of bioscience text, in Proceeings of the SIGIR04 workshop on Search an Discovery in Bioinformatics, [32] K. Lai an S. Wu, Using the patent co-citation approach to establish a new patent classification system, Information Processing an management, vol. 41, no. 2, pp , [33] X. Li, H. Chen, Z. Zhang, an J. Li, Automatic patent classification using citation network information: an experimental stuy in nanotechnology, in Proceeings of the 7th ACM/IEEE-CS joint conference on Digital libraries. ACM, 2007, p Artemy Kolchinsky is pursuing a PhD in the complex systems track of the School of Informatics an Computing, Iniana University Bloomington. He is also a visiting grauate stuent at the FLAD Computational Biology Collaboratorium at the Instituto Gulbenkian e Ciencia, Portugal. Alaa Abi-Haiar receive an MS in Computer Science from Iniana University. He is currently a PhD caniate in the School of Informatics an Computing in Iniana University. His current research interests inclue text mining, classification, bio-inspire computing an artificial immune systems. Jasleen Kaur receive an MS in Bioinformatics from Iniana University Bloomington in She is currently pursuing a PhD in Informatics in the complex systems track of the School of Informatics an Computing, Iniana University Bloomington. Her research interests inclue text mining, literature mining, bioinformatics an social networks mining.

14 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.?, NO.?, Ahme Abeen Hame has an MS in Computer Science from Iniana University an is a part-time PhD stuent in Computer Science at University of Vermont. His research interests are text mining, web mining, an scientific workflows. He is concerne with ecosystems monitoring an eveloping scientific workflows that can prouce alerts for conservationists an ecision makers. His is currently holing a professional employment with the Marine Biological Laboratory, Woo Hole, Massachusetts. Luis M. Rocha is an Associate Professor at the School of Informatics an Computing at Iniana University, Bloomington, where he has irecte the PhD program on Complex Systems an is also member of the Center for Complex Networks an Systems an core faculty of the Cognitive Science Program. He is also the irector of the FLAD Computational Biology Collaboratorium an in the irection of the associate PhD program in Computational Biology at the Instituto Gulbenkian a Ciencia, Portugal, where the central goal is interisciplinary research involving the life sciences. His research is on complex systems, computational biology, artificial life, emboie cognition an bio-inspire computing. He receive his Ph.D in Systems Science in 1997 from the State University of New York at Binghamton.

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for