Learning Ranking vs. Modeling Relevance

Size: px

Start display at page:

Download "Learning Ranking vs. Modeling Relevance"

Lenard Richardson
6 years ago
Views:

1 Learning Ranking vs. Modeling Relevance Dmitri Roussinov Department of Information Systems W.P.Carey School of Business Arizona State University Abstract The classical (ad hoc) document retrieval problem has been traditionally approached through ranking according to heuristically developed functions (such as tf.idf or bm25) or generative language modeling, which requires explicit assumptions about the term distributions. The nowadays popular discriminative (classification, machine learning, statistical forecasting etc.) approaches have been mostly abandoned while solving this task in spite of their success in a different task of text categorization. In this paper, we studied if a classifier can be trained solely based on labeled examples to successfully generalize to new (unseen by the system) queries and provide performance comparable with popular heuristic or language models. Our SVM-based classifier learns from the relevance judgments available with the standard test collections and generalizes to new, previously unseen queries its ability to compare and rank documents with respect to a given query. To accomplish this, we have designed a representation scheme, which is based on the discretized form of the high level statistics of the query term occurrences (such as tf, df, and document length) rather than individual terms. Using the standard metrics of average precision and the standard large and small test collections we confirmed that our machine learning approach can achieve the performance comparable with and better than the performance of the current state of the art models. 1. Introduction And Prior Work In spite of several decades of research, modern document retrieval technology still has to overcome the burden of information overload. Although many improvements have been successfully tried to improve document ranking with respect to a user s request, they all require manual parameter tuning in order to be helpful rather than detrimental within a particular application. The variety of techniques only disorients the practitioners and designers of document management systems, suggesting no clear winner and no methodology to choose the applicable improvement techniques and their parameters. On the other side, many practitioners are well familiar with machine learning (classification) paradigm, where the training sets are (typically manually) developed and the appropriate technological solutions are selected and their parameters are tuned on those sets. Although used with striking success for text categorization, classification-based approaches (e.g. those based on support vector machines Weiguo Fan Accounting and Information Systems Department Virginia Tech wfan@vt.edu [9]) have been relatively abandoned when trying to improve ad hoc retrieval in favor of empirical (e.g. vector space [15]) or generative (e.g. language [19]) models, which produce a ranking function that gives each document a score, rather than trying to learn a classifier that would help to discriminate between relevant and irrelevant documents and order them accordingly. A generative model needs to make assumptions that the query and document words are sampled from the same underlying distributions and that the distributions have certain forms, which entail specific smoothing techniques (e.g. popular Dirichletprior). A discriminative (classifier-based) model, on the other side, does not need to make any assumptions about the forms of the underlying distributions or the criteria for the relevance but instead, learns to predict to which class a certain pattern (document) belongs to based on the labeled training examples. Another important advantage of a discriminative approach for the information retrieval task, is its ability to explicitly utilize the relevance judgments existing with standard test collections in order to train the IR algorithms and possibly enhance retrieval accuracy for the new (unseen) queries. Our work is motivated by the objective to bring closer numerous achievements in the domains of machine learning and classification to the classical task of ad-hoc information retrieval (IR), which is ordering documents by the estimated degree of relevance to a given query. Our classifier learns how to compare every pair of documents with respect to a given query, based on the relevance indicating features that the documents may have. As it is commonly done in information retrieval, the features are derived from the word overlap between the query and documents. The earliest formulation of the classic IR problem as a classification (discrimination) problem was suggested by Robertson and Sparck Jones [13], however performed well only when the relevance judgments were available for the same query but not generalizing well to new queries. Fuhr and Buckley [4] used polynomial regression to estimate the coefficients in a linear ranking function combining such well-known features as a weighted term frequency, document length and query length. They tested their description-oriented approach on the standard small-scale collections (Cranfield, NPL, INSPEC, CISI, CACM) to achieve the relative change in the average precision ranging from -17% to + 33% depending on the collection tested and the implementation parameters. Gey [6] applied logistic regression in a similar setting with the following results: Cranfield +12%, CACM /06/$20.00 (C) 2006 IEEE 1

2 +7.9%, CISI -4.4%, however he did not test them on new (unseen by the algorithm) queries, hypothesizing that splitting documents into training and testing collections would not be possible since a large number of queries is necessary in order to train for a decent logistic regression approach to document retrieval. Instead, he applied a regression trained on Cranfield to CISI collection but with a negative effect. Recently, the approaches based on learning have reported several important breakthroughs. Fan et al. [4] applied genetic programming in order to learn how to combine various terms into the optimal ranking function that outperformed the popular Okapi formula on robust retrieval test collection. Nallapati [12] made a strong argument in favor of discriminative models and trained an SVM-based classifier to combine 6 different components (terms) from the popular ranking functions (such as tf.idf and language models) to achieve better than the language model performance in 2 out of 16 test cases (figure in [12]), not statistically distinguishable in 8 cases and only 80% of the best performance in 6 cases. Greiff [7] derived the optimal shape of global weighting on a set of TREC collections resulting in 8-86% improvement over INQUERY ranking formula. There have been studies using past relevance judgements to optimize retrieval. For example, Joachims [10] applied Support Vector Machines to learn linear ranking function from user click-troughs while interfacing with a search engine. We would like to emphasize that the task considered here is fundamentally different from routing, filtering, text categorization or any framework based on user relevance feedback: we are optimizing retrieval for new, previously unseen queries for which no relevance judgments are assumed to be available. In this study, we present a representation scheme, which is based on the discretization of the global (corpus statistics) and local (document statistics) weighting of term overlaps between queries and documents. The major difference of our work from Fan et al. [4] or Nallapati [12] or works on fusion (e.g. [18]) is that we did not try to combine several known ranking functions (or their separate terms) into one, but rather we learn the weighting functions directly through discretization. The shorter versions of this paper with a slightly different focus were presented earlier:?? Discretization allows representing a continuous function by a set of values at certain points. These values are learned by a machine learning technique to optimize certain criteria, e.g. average precision. Thus, we believe our approach offers a significant advantage since it does not limit the shapes of the learned ranking functions to a certain class of functions suggested after heuristic explorations of language modeling done in prior research. We have also empirically established that our combination of the representation scheme, learning mechanism and sampling allows learning from the past relevance judgments in order to successfully generalize to the new (unseen) queries. When the representation was created without any knowledge of the top ranking functions and their parameters, our approach reached the known top performance solely through the learning process. When our representation was taking advantage of functions that are known to perform well and their parameters, the resulting combination was able to slightly exceed the top performance on large test collections and considerably exceed the performance on small scale standard test collections. The next section formalizes our approach, followed by empirical results and conclusions. 2. Formalization Of Our Approach Our approach to ad hoc document retrieval learns how important each type of an occurrence of a query term in a document. For example, in some very primitive way (for illustration only), we can define two document features: feature S ( strong ), indicating multiple occurrences of a rare query term (e.g. discretization ) in a document, and feature W ( weak ), indicating a single occurrence of a frequent term (e.g. information ). The particular terms ( discretization and information ) are not used directly in the representation, so all the multiple occurrences of the rare terms and single occurrences of the frequent terms are treated the same way. Then, a machine learning technique should discover that the feature S is much stronger indicator of relevance than the feature W. In the implementation presented in this paper, each occurrence of a query term t in a document d is assigned to a bin (specified by an integer number within a limited range) based on the term document frequency in the collection (df) and the number of the term occurrences within the document (tf). By learning the discrimination properties of each feature (bin), rather than separate terms, our method allows generalization to new queries. Thus, the ranking functions studied in this paper are limited to the so called lw.gw class: R(q, d) = t q L ( tf ( t, d ), d ) G ( t ) Here L(tf, d) -- local weighting, is the function of the number of occurrences of the term in the document tf(t, d), possibly combined with the other document d statistics, e.g. document length in words. G(t) -- global weighting, can be any collection level statistic of the term (e.g. df - document frequency). It can be easily verified that this class of ranking functions is very general and includes all the well known successful ranking functions such as variations of tf.idf and 2

3 BM25 (Okapi). For example, in the classical tf.idf formula L(tf, d) = tf / d, where tf is the number of occurrences of the term t in the document, d is the length of the document vector and G(t) = idf(df(t)) = log (N / df(t)), where df(t) is the total number of documents in the collection that have term t. N is the total number of documents. The lw.gw representation of BM25 is discussed below in detail. It can be also shown that many of the recently introduced language models fall into that category as well, specifically the best performing in TREC ad hoc tests Dirichlet smoothing, Jelinek Mercer smoothing, and Absolute Discounting approaches can be represented that way (see equation 6 and table I in [19]). It has been known for a long time that the shapes of the global and local weighting functions can dramatically affect the retrieval accuracy in standard test collections. However, we are not aware of any attempts to learn those shapes directly from the labeled examples, which we performed in this study. Thus, our central research question was the following: can the optimal (best performing) shapes of the global and local functions be learned purely from labeled examples, without heuristic experimentation or elaborate analytical modeling and assumptions about term distributions? Each occurrence of a query word in a document is assigned to a bin. Each bin is specified by two numbers: g (for global) in the range [1, B] and l (for local) in the range [1, L] as following: log (df(t)) g(t) = {B(1- )} log(n) (1) l (tf(t, d), d) = min( tf (t, d), L )) (1a) where N is the total number of documents, {.} stands for rounding down to the nearest integer. The logarithmic scale allows more even term distribution among bins than simple linear assignment, which is desirable for more efficient learning. It is motivated by a typical histogram of df(t) distribution, which looks much more uniform in a logarithmic scale. It is important to note that it does not have anything to do with the log function in the classical idf weighting. Formula (1) does not produce any weights but only assigns each term occurrence to a specific bin based on the term document frequency. The weights are later trained and effectively define any shape of global weighting, including those tried in the prior heuristic explorations: log, square root, reciprocals and other functions. Let s note that in our case l(tf, d) formula does not really need rounding to an integer since tf is already a positive integer. However, in a more general case, tf can be normalized by document length (as is done in BM25 and language models) and, thus, local weighting would become a continuous function. It is important to note that our discrete representation does not ignore the occurrences above L but simply treats them the same way as tf = L. The intuition behind this capping is that increasing tf above certain value would not typically indicate the higher relevance of the document. Each occurrence of a query term in a document corresponds to a bin (g, l). Each (g,l) combination determines a feature in a vector representing a documentquery pair f(d, q) and is denoted below as f( d, q) [g, l]. The dimensionality of the feature space is L x B. E.g. for 8 local weighting bins and 10 global weighting bins we would deal with the vector size of 80. A feature vector f(d, q) represents each document d with respect to query q. Since the query term occurrences assigned to the same bins are treated the same way, the value of each feature in the vector is just the number of the term occurrences assigned to each bin (g, l): f ( d, q) [g, l] = t q, g ( t ) = g, l ( t, d ) = l 1 (2) Now, for the document ranking function, we can simply use the dot product between the feature vector and the vector of learned optimal weights w: R(q, d) = w * f ( d, q) Ideally, the learning mechanism should assign higher weights to the more important bins (e.g. multiple occurrence of a rare term) and low weights to the less important bins (e.g. single occurrence of a common terms). The exact learned values determine the optimal shape of global and local weighting. Table 1 shows an example of the bin assignments and the resulting feature vector for a specific document and the query the anti missile defense system of star wars. The bins with lower numbers (0-7) correspond to the terms with large document frequencies (the, of). Since they happen to be in almost all documents, their weights will be learned to be very small (non discriminative). They could be alternatively removed by a stop word list. The other, less frequent words, occupy large bins (g > 0) and l mostly equal 7 (corresponds to tf = 8) since tf is capped at 8, except star and wars having l = 5 (corresponds to tf = 6). The feature vector representing query/document pair has only the following non zero coordinates (bins): 7, 15, 22, 23 and 20, with the occurrences within the same bins (e.g. 7, 15 and 23) aggregated. 3

4 Table 1. An example of term frequencies, the resulting bin assignments and document/query feature vector. Qid = 0 docid = doclen = avgdoclen = term: the TF: 42 DF: g:0 l:7 bin: 7 term: of TF: 11 DF: g:0 l:7 bin: 7 term: system TF: 20 DF: g:1 l:7 bin: 15 term: defense TF: 8 DF: g:1 l:7 bin: 15 term: anti TF: 10 DF: 5126 g:2 l:7 bin: 23 term: star TF: 6 DF: 4536 g:2 l:5 bin: 22 term: missile TF: 15 DF:1237 g:2 l:7 bin: 23 term: wars TF: 6 DF: 657 g:3 l:5 bin: 30 document/query feature vector: 7:2, 15:2, 22:1, 23:2, 30:1 We still can make the representation more powerful by considering the learned weights w[g, l] not the replacements but rather the adjustments to some other, heuristically chosen, global G (t) and local L (t, d) weighting functions (e.g. bm25): f ( d, q) [g, l] = t q, g( t) = g, l ( tf ( t, d ), d ) = l L( t, d) G( t) (2a) We define the specific choice of global G() and local L() weighting functions as starting ranking function (SRF). When all the bin weights w[g, l] are set to 1, our ranking function is the same as its SRF. The learning process finds the optimal values for w[g, l] for the collection of training queries and their relevance judgments, thus adjusting the important shapes of the global and local weighting to achieve better accuracy. SRF can be chosen from one of the known to perform well ranking functions (e.g. tf.idf or BM25 or based on language models) to take advantage of the fact that those formulas and their optimal parameters on the standard test collections are known for the researchers. Alternatively, we can set SRF to the constant value (e.g. 1 in formula 2), thus not taking advantage of any of the prior empirical investigations and to see if our framework is able to learn reasonable (or even top-notch) performance purely from labeled examples. Below, we describe our experiments with each approach. Since the score is linear with respect to the feature values, we can train the weights w as a linear classifier that predicts the preference relation between pairs of documents with respect to the given query. Document d1 is more likely to be relevant (has a higher score) than document d2 iff f(d1, q) * w > f(d1, q) * w and vice versa. An important advantage of using a linear classifier is that rank ordering of documents according to the learned pairwise preferences can be simply performed by ordering according to their linear score f(d, q) * w. Please refer to [2] for the ordering algorithms in a more general non linear case. We chose support vector machines (SVM) for training the classifier weights w[g, l] since they are known to work well with large numbers of features, ranging in our experiments from 8 to 512, depending on the number of bins. For our empirical tests, we used the SVMLight package freely available for academic research from Joachims [9]. We preserved the default parameters coming with version V Comparison Tests 3.1 Experiments with large collections We used the TREC, Disks 1 and 2, collections to test our framework, topics for training and for testing and vice-versa. For indexing, we used the Lemur package [11], with the default set of parameters, and no stop word removal or stemming. Although those procedures are generally beneficial for accuracy, it is also known that they do not significantly interfere with testing various ranking functions and thus are omitted in many studies to allow easier replication. We used only topic titles for queries to simulate short queries typically run by online surfers or company employees trying to locate a document. We used the most popular average (non-interpolated) precision as our performance metric. The characteristics of the collection after indexing are shown in Table 2. We also reproduced results similar to the reported below on the Disk 3 collection and topics , but did not include them in this paper due to size limitations. Table 2. The characteristics of the test collection: TREC Disks 1 and 2. Collection Number of documents Number of terms Number of unique terms Average document length Topics TREC Disks 1 and 2 741, ,059, ,

5 Table 3. Learning without any knowledge of ranking functions. 16 x 8 bin design. Testing: Training: Original Learne d Baseline Original Learned Baseline The choice of the baseline is very important for the validity of the findings. We used the results reported in [12] as guidance. According to [12] the best performing language model on this test collection was the one based on the Dirichlet smoothing, which we informally verified by varying the parameters available in Lemur. We found the optimal parameter μ = 1900 to be the same as the one reported in [12] but the average precision lower (0.205 vs ). The difference may be attributed to the different indexing parameters, not using stemming or a stopword list. By experimenting with the other ranking functions and their parameters, we noticed that the implementation of BM25, available in Lemur, provided almost identical performance (0.204). Its ranking function is BM25 (tf, df) = tf / (tf + K * (1 b + b * d / d a ) * log ( N / (df +.5)), where d is the document word length and d a is its average across all documents. The optimal parameter values were close to the default K = 1.0 and b =.5. We noticed that the query term frequency components could be ignored without any noticeable loss of precision. This may be because the TREC topic titles are short and the words are very rarely repeated in the queries. Since the difference between this ranking function and the optimal from the available language models was negligible we selected the former as both our baseline and also as the starting ranking function (SRF) in our experiments. For simplicity, we call it simply BM25 throughout our paper. First, we were curious to see if our framework can learn reasonable performance without taking advantage of our knowledge of the top ranking functions and their parameters. For this, we set our starting ranking function (SRF) to a constant value (1.0), thus using only the minimum out of the empirical knowledge and theoretical models developed by information retrieval researchers during several decades: specifically only the fact that relevance can be predicted by tf and df. Table 3 shows performance for the 16 x 8 combination of bins. It can be seen that our approach has reached % of the top performance solely through the learning process. The original performance is the one obtained by assigning all the classifier weights to 1. When the same set was used for training and testing the result obviously overestimates the learning capability of the framework. However, it also gives the upper bound of performance of a discretized gw.lw combination. Since we already informally demonstrated that our discrete representation is almost identical in performance to the smooth one, we have an estimate of the upper bound of performance for the entire family of gw.lw ranking functions, which includes all the popular ones such as tf.idf, BM25 or some of the language models. Attaining this upper bound in practice may require much larger number of training examples or further improvement in the weighting functions due to analytical modeling. In order to evaluate if more training data can help, we also ran tests using 90 topics for training and the remaining 10 for testing. We ran 10 tests each time using 10 different sequential topics for testing and averaged our results. In this case, the averaged performance was completely restored to the baseline level with the mean difference in precision across test queries +0.5% and 1% standard deviation of the mean. We believe this is a remarkable result considering the difficulties that the prior learning based approaches had with the classical information retrieval task! We attribute our success to both higher flexibility and generalizability of our discrete representation. We also varied the number of bins to evaluate the effect of granularity of representation. Figures 1 and 2 demonstrate that 8 bins suffice for both global and local weighting. Higher numbers did not result Table 4. Surpassing the baseline performance. 8 x 8 bin design. Testing: Training: Learned Baseline % change Learned Baseline (+/- 0.9) (+/- 1.0) (+/- 1.0) (+/- 1.3) 5

6 in noticeable improvements. Figure 1. Learning local weighting for various numbers of bins. Learning on and testing on Average precision Number of bins Figure 2. Learning global weighting for various numbers of bins. Learning on and testing on Average precision Basel Lear Basel Lear Number of bins In order to test whether our approach can exceed the baseline performance we set BM25 to be our starting ranking function (SRF). Thus, in this case G(t) = log ( N / (df +.5)) (6) L(tf, d) = tf / (tf + K * (1 b + b * d / d a ) Table 4 shows performance for the 8 by 8 bin design. Although the improvement is relatively small (2-3%) it is still statistically significant at the level of alpha < 0.1, when the paired t-test was performed. The value in % change column shows the mean % improvement across all the queries and its standard deviation. It may differ from the % change of the mean performance since there is wide variability in the performance across queries but smaller variability in the improvement. We believe even such a small improvement is remarkable considering the amount of attention the researches have paid to optimizing the ranking functions for this specific data set which has been available for more than seven years. A number of recent studies reported Table 5. Small test collections and their baseline performance. Collection Number of Queries Number of Documents Baseline Average Precision Cranfield NPL Med CISI comparable improvements on the same test collection by using more elaborate modeling or richer representations. Of course the improvement due to the techniques such as those based on n-grams, document structures, natural language processing or query expansion can possibly achieve even better results. However in this study we deliberately limited our focus to the bags of words. 3.2 Experiments with small collections Information Retrieval from small and mid size collections is still important within organizations for their day-to-day activities, for example such as locating important messages, policy manuals or customer complaint tickets. However, much of the recent studies have been performed on the TREC-size collections. Small test collections have been also extensively studied in the past including the works mentioned in our Introduction. In order to address the possible practical value of searching small collections and to compare our results with the past efforts, we also performed our experiments on the following classical collections: Canfield, NPL, CISI, and Med. Table 5 lists the properties of the test collections and the baseline performance on them. Table 6 shows the effects of learning across queries. We only explored learning global weighting in those tests. The number of global weighting bins was set to 10. As you can see the effect was overall positive, ranging from -4% to +28% of relative improvement. The results also show that generally there was no danger of overfitting: effects on training and testing sets were similar. The only negative effect listed in the Table 5 corresponds to a zero learning effect on the training set. Since in practice it can be predicted during the training phase, the degradation could be easily avoided by not applying our technique and using a traditional scoring method instead. Table 7 shows the learning effect across different collections, 12 combinations total, tested once each. Positive relative improvements ranged from 0 to +28%. The only noticeable negative effect were again the results of learning on CISI collection. CISI does not demonstrate improvement on the training set either, so can be excluded 6

7 Table 6. Learning across queries. The effects on all 4 small test collections. Collection Crossvalidation Precision on the training set Precision on the testing set Cranfield NPL Med CISI Relative Improvement from the application of our method in practice. It is remarkable that NPL collection improves as much as if trained on Cranfield as if trained on itself. We can observe that NPL and Cranfield are more amenable for the technique. It is not surprising since they have much larger numbers of queries than the other two. Our across collection training results are very encouraging and stronger than reported in the prior related studies mentioned in our introduction. We believe the effects can be also further increased by normalization of features within each collection, a standard procedure in machine learning, which we did not try in our study. 4. Other Conclusions, Limitations And Future Work We explored learning how to rank documents with respect to a given query using linear Support Vector Machines and discretization-based representation. Our approach to information retrieval represents a family of discriminative approaches, currently not well studied by researchers. Our experiments indicate that learning from relevant judgments available with the standard test collections and generalizing to new queries is not only feasible but can be a very powerful source of improvement. When tested with a popular standard collection, our approach achieved the performance of the best well-known techniques (BM25 and language models), which have been developed as a result of extensive past experiments or elaborate theoretical modeling. When combined with the best performing ranking functions, our approach added a small (2-3%), but statistically significant, improvement. Although practical significance of this study may be limited at the moment since it does not demonstrate a dramatic increase in retrieval performance in large test collections, we believe our findings have important theoretical contributions since they indicate that the power of discriminative approach is comparable to the best known analytical or heuristic apporaches. This work also lays the foundation for extending the discriminative approach to richer representations, such as those using word n-grams, grammatical relations between words, and the structure of documents. We deliberately limited our investigation to bag of words approach and did not use bigrams, LSI, query expansion, or pseudo relevant feedback. Under those conditions, our small improvements reported here are comparable with those reported in other recent works when evaluations were made relatively to a known strong baseline (e.g. Okapi/BM25 or Language Models) and ad hoc TREC collections. On classical small test collection, our learning approach demonstrates significant improvement, ranging from 0 to 30% and even works when trained on a different collection, which prior approaches failed to accomplish. We believe that our approach performs well because it learns the important function shapes of global and local weighting. The major advantages of our approach are the following: Simplicity: It does not require any analytical modeling and making assumption about statistical distributions of query and document terms. Extensibility: The approach can easily involve other learning techniques and other relevance features such as those based on n-grams, part of speech, structural elements of a document (title, headings) or general properties of a document (popularity, style, trustworthiness, etc.). It can also incorporate other classifiers. Explicitness: Through analysis of the learned weights it allows interpreting the importance of specific classes of terms (e.g. frequent vs. rare) and occurrences of terms in documents (e.g. single occurrence vs. multiple). Of course, using only few test cases (topics sets and collections) is a limitation of this current study, which we are going to address in our future research. We view our approach as a complement, rather than competitive, to the analytical approaches such as language models. Our approach can be also used as an explorative tool in order to identify important relevance-indicating features, which can be later modeled analytically. We believe that our work and the ones referred in this paper may bring many of the achievements made in a more general area of classification and machine learning closer to the task of rank ordered information retrieval, thus making retrieval engines more helpful in reducing the information overload and meeting people s needs. 5. Acknowledgement Weiguo Fan's work is supported by NSF under the grant number ITR Roussinov s work was supported by Dean s Award of Excellence, W.P. Carey School of Business, summer

8 Table 7. Learning across small collections. Training Collection Testing Collection Average absolute effect Relative effect on the on the testing testing collection (%) collection (%) Cranfield NPL Cranfield Med Cranfield CISI NPL Cranfield NPL Med NPL CISI Med Cranfield Med NPL Med CISI CISI Cranfield CISI NPL CISI Med References [1] Bartell, B., Cottrell, G., and Belew, R. (1994). Optimizing Parameters in a Ranked Retrieval System Using Multi-Query Relevance Feedback. Symposium on Document Analysis and Information Retrieval (SDAIR). [2] Cohen, W., Shapire, R., and Singer, Y. Learning to order things. Journal of Artificial Intelligence Research, 10, , [3] Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. Proceedings of the Twelfth International Conference on Machine Learning (pp ). Tahoe City, CA: Morgan Kaufmann. [4] Fan, W., Luo, M., Wang, L., Xi, W., and Fox, E. A. (2004). Tuning Before Feedback: Combining Ranking Discovery and Blind Feedback for Robust Retrieval. Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), [5] Fuhr, N. and C. Buckley (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems, 9, [6] Gey, F. C. (1994). Inferring probability of relevance using the method of logistic regression. In Proceedings of the 17th ACM Conference on Research and Development in Information Retrieval (SIGIR 94), pp [7] Greiff, W. A Theory of Term Weighting Based on Exploratory Data Analysis. ACM SIGIR [8] Hearst, M. (1998). Support Vector Machines. IEEE Intelligent Systems Magazine, Trends and Controversies, Marti Hearst, ed., 13(4), July/August [9] Joachims, T., A Statistical Learning Model of Text Classification with Support Vector Machines. Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), [10] Joachims, T., Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, [11] Kraaij, W., Westerveld T. and Hiemstra, D., The Lemur Toolkit for Language Modeling and Information Retrieval, 2.cs.cmu.edu/~lemur/ [12] Nallapati, R. (2004). Discriminative models for information retrieval. Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2004, pp [13] Robertson S. E. and Sparck Jones, K., Relevance weighting of search terms, Journal of American Society for Information Sciences, 27(3), pp ,

9 [14] Robertson, S. E., Walker, S., Jones S., Hancock- Beaulieu M.M., and Gatford, M., Okapi at TREC- 4, in D. K. Harman, editor, Proceedings of the Fourth Text Retrieval Conference, pp NIST Special Publication , [15] Salton, G. and McGill, M.J. (1983). Introduction to Modern Information Retrieval. New York. McGraw-Hill. [16] Song, F. and Croft, W.B. A general language model for information retrieval. In Proceedings of Eighth International Conference on Information and Knowledge Management (CIKM 99). [17] Vapnik, V. N. Statistical Learning Theory. John Wiley and Sons Inc., New York, [18] Vogt, C., Cottrell, G. (1999). Fusion Via a Linear Combination of Scores. Information Retrieval, 1(3), pp [19] Zhai, C., and Lafferty, J. (2001). A study of smoothing methods for language models applied to Ad Hoc information retrieval. Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), pp ,

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for