Report about: Machine Learning for Static Ranking Christian Klar

Report about: Machine Learning for Static Ranking Christian Klar Why is Static Ranking so important nowadays? Since many years the Web has grown exponentially in size But with this growth, the number of low-quality pages has increased as well The Web is full of incorrect, malicious and spamming Websites So it is more important than ever, to be able to grade the quality of a specific website This is done with Static Ranking, also called Query-Independent Ranking Basically there are a bunch of websites Now the quality of each website is calculated by looking at some characterizing points of the Website Through this quality, each website gets a rank, and through this rank, a ranking is built Websites that are higher in the ranking, should also have a higher quality Static Ranking has a lot of direct and indirect benefits A direct one is of course that we can see how good a page is An indirect one is, that we can built up Crawl Priorities Since the Web changes and grows each day, it is impossible for a search engine to crawl the whole Web during a short time There have to be priorities, which websites should be revisited, and how frequently Static Ranking obviously helps, in creating these priorities, eg a website with high quality should be revisited often, because changes there have a higher impact The Google PageRank is widely regarded as the best method for Static Ranking Although it has performed well in history, only little academic evidence exists to prove this point The purpose of this report is to introduce a different Static Ranking function called RankNet The introduction of RankNet will be in steps, so that the reader should be able to understand how a Static Ranking function is created Therefore the emphasize of this report is on machine learning itself This report is about showing the concepts and bringing the information of several papers concerning this problem together, and not about giving mathematical proof for every step made 1 Problem and Approach First we have to clearly state the problem of finding a Static Ranking function Each website has a feature vector which consists of real numbers The feature vector describes specific points about the website and characterizes it Each column describes 1

a feature For example if 2 websites have a different value in the first component of the vector, then they are different in that point Now we are looking for a function that gives us a rank value for a specific feature vector That means, if we put the feature vector into the function, we get a rank value as output and we can compare it with other websites rank values Figure 1 Overview over definitions has feature vector and rank So is the Static Ranking function It gives each website a rank according to their feature vector But we don t know in what way we have to put the features,, together, so that the ranking value, that we get is accurate Eg it is possible that one feature is more important than the other The way to derive what does with,,, is to manually look at websites, and to give them rank values Then it would be possible to create a function that does, what was done manually before But then again, it is impossible to look at all Websites in the Web So has to be able to generalize the information it knows, and to apply it on other websites, we haven t seen before So summarized has to do the following things: It has to represent the information we have in an accurate way It has to generalize that information in an accurate way Our approach to create will be machine learning That means, that we simply give the information, we have, to an learning algorithm It takes it, and forms Basically, the learning machine teaches an start - what to do This is also, what is done in RankNet To understand RankNet, at first some basics about machine learning will be introduced, and then applied on the problem of finding the Static Ranking function 2 Learning of a function with gradient descent To learn a function we first need some information about that function For that we have some input and output pairs,,,, With these there are several ways to teach a function that it behaves like,, Regression would be one Another one would be to just do some interpolation, which isn t chosen 2

because that would be too inflexible for this purpose Our is categorized as gradientbased learning with back-propagation Why, will be clear during this chapter Figure 2 The learning machine,,,,,,,,,,,,, COST FUNCTION,,,,,, gets the form,, which means, that it gets the vector as input It also gets as input represents a collection of adjustable parameters or weights in the system of That means, that with these parameters it is possible to tune, so that it performs its job in a better way So, when gets the input it computes a value If, determined by a given set of parameters, would already be the function we are looking for, then would already be very close to But since we have to train first, this usually isn t the case Therefore we need a measure for the error between and This is done by the cost function It takes the wrong output and the correct and calculates, which is the deviation value between the output of the momentary version of function, and the target output we want to reach with the learning machine With a particular set of parameters and the set of pairs,,,,, we are now able to compute the deviation for each The overall cost function, which summarizes all errors then would be With this function we are now able to see the overall performance of the current The smaller is, the better are our parameters Therefore it is obvious that will be used to tune the weights in, so that performs its task better But how is it possible to adapt the parameters in W so that the value of decreases? The gradient descent method will be used for that Since is a function with the form,,,, it is clear that the parameters in are also a direct input into it 3

And because the set of pairs,,,, is definite, it is basically just a point of playing with the parameters until a good constellation of them is found so that the value of is small enough But as playing isn t a very efficient way, the gradient descent method will applied to do that Figure 3 The gradient descent method This is an approximation method, which at each point chooses the direction of steepest descent to find the minimum of a multidimensional function, which is a function of the variables,,,,,,, So, to quickly recapitulate, there is a function of the form,, whose output can be changed, by tuning the weights in The goal is to tune them like that, that for certain known inputs we get some outputs, that are close approximations to the target outputs The overall cost function shows how far away we are from our goal The smaller the value of the overall cost function, the closer we are to it For showing, how the gradient descent method is applied on the cost function, has to get a more detailed look (until the end of this chapter will be denoted with, due to convenience) Here a very simple form of a multilayer machine is chosen, namely just a stack of modules which is presented in figure 4 on the left side The input is processed through modules until the output is calculated On each module there is a function, which has and as input is the output of the module 1 and is a vector of tunable parameters for the function is a subset of, which is the set of all weights in this system (1),,,,, With 1 as start, we have to compute 1 until we reach a local minimum shows how an input is processed through the module-functions Each is a vector function To see how this works, figure 5 has a little example about the computation until module number 2 In the example, there is an input vector with dimension 3 It is the input of the vector function which consists of the two functions and In this layer,, Then this new vector is given to the next module and so on It is important to point out that obviously the dimension can change 4

in each module So input vector and output vector don t have to be of the same dimension Figure 4 Stack of modules,,,,,,, Remarks:, is the Jacobian of with respect to evaluated at the point,, is the Jacobian of with respect to evaluated at the point, Figure 5 Example for computation,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, So, what do we need for being able to use the gradient descent method on? We need But before we see how it is calculated, some notations: 5

(2) (4) (3),,,,,,,,,,,,,,,, So we need the partial derivatives of,, with respect to the vectors for each module It is possible to compute these sub-vectors of,, by doing so called back-propagation It is presented in figure 4 on the right side If is known, and can be computed Then again with, and can be computed Going on like that we get for all modules 1 So here are the steps for computing, for one training-pair, : 1 Put as input into the stack of modules Compute 2 Compute, That means, first do the partial derivation of with respect to the variables of the -th layer (excluding the parameters in ), and then put into that formula to get the vector, 3 Compute vector, right side and vector, 4 Do the same for the modules 1, 2,,2,1 5 Put the vectors, together to get, with the rules described in figure 4 on the, 1, 6

By knowing, the gradient descent algorithm for this problem can be built, which is: (5) 1 With this algorithm we can iteratively adjust Hence we have a learning algorithm for getting the right parameters for For RankNet a special case of this learning system will be used, namely artificial neural networks Therefore the next chapter introduces neural networks, and in the chapter after that, we will see how neural networks are applied here 3 Artificial Neural Networks (ANN) The idea behind ANNs is very close to the prior topic An ANN is a computing paradigm, that is modeled after the cortical structures of the brain It consists of interconnected processing elements called nodes or neurons, that work together to produce an output function In most cases an ANN is an adaptive system that changes its structure based on information that flows through the network Figure 6 Example of an Artificial Neural Network x Layer 1 Layer 2 Layer 3 To demonstrate how an ANN is built, an example is made, which is presented in figure 6 As already stated an ANN consists of so called neurons or nodes In the example these are,,,, and the output These nodes are connected together trough arrows, hence also the name of this system Neural Network Each node belongs to an specific layer and represents a function So the ANN is characterized by the number of layers and the number of nodes in each layer Inside a layer the nodes are independent of each other 7

Each node has an vector and a single value The function in each node takes the input values of the vector, and creates the value by performing some calculation The values are the values of the prior layer This system is also called feed-forward x denotes the vector that is given to the system as input The node is the last node and its function creates the output of the system which is a single value So basically the ANN defines a functionn So, that the structure of this ANN is clear, we have to take a look at the functions of the nodes Each node is defined by following mathematic function: (1),, are weights that denote how much each should be weighted So first each input gets a weight, and then everything is summed up Lastly this sum is put into a function, which is also called activation function Now it should be obvious why ANNs are so important for our purpose of creating a learning machine for a Static Ranking function They take an input vector and create a single output value And by changing the weights in each node, it is possible to alter that output So ANNs have excellent training capabilities and are also very good in generalizing from a set of training data An informal example could be an ANN which decides if an animal A belongs to a group G It takes some characteristicss of A as input and then decides The point now is, that we don t have to give the ANN an exact characterization of each animal that belongs to that group G Through some specific training-characterizations of some animals belonging to G the ANN will learn it by itself Figure 7: The sigmoid function 16 tanh Back to the activation functionn It weights how powerful the output should be, based on the weighted sum of inputs Usually this is a monotonically increasing sigmoid function The reason is, that only through this nonlinear activation, the neural network gets its nonlinear capability for learning It is easy to understand this concept If we 8

would use a linear function, the changes on the weights would cause abrupt changes on the output With a sigmoid function everything goes much more smoothly so that the ANN is able to learn with less noise The example sigmoid function 16 tanh is presented in figure 7 Using a sigmoid function has other benefits as well For example keeping the output in a specific range To find a good sigmoid function for the own purpose is a art of itself and is topic of many scientific papers Now we have seen two things The gradient descent method, that allows us to teach a function to behave in a special way And the notion of ANN, that simulates a big function with a system of layers and nodes, and which is nearly perfect in learning and generalizing In the next chapter this two notions will be brought together to create a system we can use for creating a Statistic Ranking function called RankNet 4 Bringing gradient descent and Artificial Neural Networks together ANNs are a special case of the Stack of modules -system, described in figure 4 The modules become the layers The function, becomes a matrix multiplication ( is called vector of weighted sums, and denotes something else than the in the training set) and a vector function which is applied to In our case applies a sigmoid function to each component of How our ANN is built up, is presented in figure 8 It is pretty self-explanatory On the right side of the figure there is an example showing how the calculation is done for one module Now that we have the ANN, we need to adapt the back-propagation at it, for being able to use gradient descent method The equations (for the -th module) we had before were: (1) (2), (3), Some other information about the -th layer: (4),,,, (7) (5) (8),, (9),, (6) 9

Figure 8,,, Stack of modules Corresponding AAN Example for a layer So, (1) gets the form: (2) gets the form: (11) (3) gets the form: (10) 1 1 1 1 (12) 1 1 1 1,, 1 1 1,, With these back-propagation equations we are able to do gradient descent for the ANN 1 10

So, to summarize the last 2 chapters, what are we able to do now? If we have a set of couples,,,,, and we want do create a function that does this mapping, we can do that, by simply feeding this information into our learning machine This is obviously the machine, we need for a Static Learning Ranking There we have this mapping as well, except, that are feature vectors, and is a ranking RankNet uses a learning algorithm like this, to do its work But before we introduce RankNet, there are several points everyone should be very aware of, before creating and training a machine like this Back-propagation can be a very slow process At each iteration the whole training set has to be passed through the system, in order to create the gradient However there is no formula that guarantees that either the network will converge to a good solution nor convergence occurs at all One very important goal of the network is to generalize But since the measurement of the training sets might be noisy, there might be errors in it It is easy to imagine, that if we collect multiple data sets, that each such training set is a little bit different of the other and contains different errors So each set would lead to different parameters for because the minimum of their overall cost function is somewhere else Like that we would also introduce the noise and the errors into our network This is also called overtraining There a number of techniques and a huge amount of papers, concerning these problems They try to maximize generalization and to speed up the back-propagation 5 RankNet Firstly we have to define how the training set and the cost function look like Normally one would take the feature vectors with their target rank values and just train the function to do exactly this mapping But that is not what we do Instead we look at the ordering of the websites and train the function with that information Through this, we optimize the ordering of the websites (that is what ranking actually is about), rather than optimizing the rank values It has to be made clear that we still are looking for a function, which gives us a rank for a feature vector But we are training it with a different kind of information So the training set is a collection of items of the form,, where denotes the feature vector of the website, denotes the feature vector of the website and denotes the probability that is ranked higher than By convention we have only training items, where should be ranked higher than The notation says, that has to be ranked higher than Hence the function we are looking for should ideally meet the following invariant: 11

Figure 9 Functions 1 1 log 1 (1),, Now we have the definition of our training set, but how do we adapt our learning algorithm, so that it teaches this information? This is where the new cost function comes into play Let, and Some points about : If 0 then with its feature vector has a higher rank than with The bigger is the bigger the difference between the ranks of and The bigger is, the bigger should be If the probability, that one page is ranked higher than the other, is big, then their rank values should show that as well So the cost function becomes: (2) 1 log 1 with 12

(3) (4) log 1, where is the cost or error value for the training item,, What (3) does is immediately obvious by looking at its diagram on the left-upper side of figure 9 It maps the rank difference to a probability The bigger the rank difference the bigger the probability (2) compares the probability we computed in (3), and the target probability we want to reach and gives the deviation the cost value It is drawn on the right side of figure 9 The plot at the bottom of figure 9 should be a help in understanding how (2) and (3) work together It shows how big the cost is (y-axis) for a specific (x-axis) and a specific target probability (yellow 095, green 05 and red 005) We can observe, that eg for a high ranking difference and a high target probability, the cost is very low So we have a cost function as well Now we can set up our ANN A 2-layer network with a single output is chosen and presented in figure 10 The functions in Layer 1 are labeled by 1, but it is always the same function The labeling is done to show how many nodes are in that Layer Figure 10 RankNet ANN Layer 1 Layer 2,,,,,,,, 13

We are now able to write the algorithm that teaches It is shown in figure 11 First just some equations which should help to understand the algorithm: (5) Figure 11 RankNet Algorithm 1 Initiations: - Initiate the vector 1,,,,,,,, with some values Note: (steps 2-4 are done for each,, in the training set, which has # such items) 2 - Calculate for - Do back-propagation and get 3 - Calculate for - Do back-propagation and get 4 - Calculate - Calculate 5 - If steps 2-4 have been done for every item of the training set, calculate: 1 1 6 - Calculate If it is small enough stop If not go back to step 2 14

6 Benefits of machine learning for Static Ranking There are many advantages besides quality of ranking, that speak for using machine learning methods for Static Ranking Here are some of the points: Since the measure consists of many features it is harder to manipulate results In Google s PageRank it is easy to get good rankings, by just increasing the number of incoming links or by other web-spamming techniques But as RankNet is able to learn, features that became unusable because of spammers can be removed from the final ranking RankNet therefore has a very good reaction time to new spamming techniques It is also possible easily to add new features of websites The change made to the algorithm are not big Since the advances in the machine learning field have increased a lot through the last couple of years, we are able to benefit from them Also the effect is reduced, that some few outlier websites have a huge impact on the whole ranking Everything is more smooth with machine learning and RankNet 7 References [1] Matthew Richardson, Amit Prakash and Eric Brill: Beyond PageRank: Machine Learning for Static Ranking, 2006 [2] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton and Greg Hullender: Learning to Rank using Gradient Descent, pages 1-4, 2005 [3] Yann LeCun, Leon Bottou, Genevieve B Orr and Klaus-Robert Müller: Efficient Backprop Neural Networks: Tricks of the Trade, Springer (pp 9-50), 1998 [4] Li-Tal Mashiach: Learning to Rank: A Machine Learning Approach to Static Ranking, 2006 [5] Genevieve Orr: Neural Networks, http://wwwwillametteedu/~gorr/classes/cs449/introhtml [6] Hsinchun Chen: Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning and Genetic Algorithms, http://aiarizonaedu/papers/mlir93/mlir93html [7] Sourceforgenet: Neural Network Theory, http://fannsourceforgenet/report/node4html 15

[8] David E Rumelhart, Bernard Widrow and Michael A Lehr: The Basic Ideas in Neural Networks, 1994 [9] Wikipedia: Artificial Neural Network, http://enwikipediaorg/wiki/artificial_neural_network 16