Introduction

Continuous approximation

Structured output learning

Direct Optimization for Web Search Ranking Olivier Chapelle

SIGIR Workshop: Learning to Rank for Information Retrieval July 23rd 2009

Perspectives

Introduction

Continuous approximation

Outline

1

Introduction

2

Continuous approximation

3

Structured output learning

4

Perspectives

Structured output learning

Perspectives

Introduction

Continuous approximation

Outline

1

Introduction

2

Continuous approximation

3

Structured output learning

4

Perspectives

Structured output learning

Perspectives

Introduction

Continuous approximation

Structured output learning

Perspectives

Web search ranking

Ranking via a relevance function Given a query q and a document d, estimate the relevance of d to q. Web search results are sorted by relevance. Traditional relevance functions (BM25) are hand designed. Recently: several machine learning approaches to learn the relevance function. Learning a relevance function is practical, but there are other possibilities: learn a ranking or learn a preference function.

Introduction

Continuous approximation

Structured output learning

Machine learning for ranking

Training data 1

Binary relevance label (traditional IR)

2

Multiple levels of relevance (Excellent, Good, Bad,...)

3

Pairwise comparisons Possibility of converting 1 and 2 into 3. Need of human editors for 1 and 2. But possibility of using 3 with large amount of click data −→ Skip-above pairs in [Joachims ’02] Rest of this talk: 2.

Perspectives

Introduction

Continuous approximation

Structured output learning

Perspectives

Information retrieval metrics Binary relevance labels: average precision, reciprocal rank, winner takes all, AUC (i.e. fraction of misranked pairs),... Multiple level of relevance: Discounted Cumulative Gain at rank p: DCGp =

X

Dp (j)G (sr (j) )

ranks

=

X

Dp (r −1 (i))G (si ),

Rank j 1 2 3

D(j) 1 1/ log2 (3) 1/ log2 (4) ...

G (sr (j) ) 3 7 0

documents

where si is the relevance score for doc i from 0 (Bad) to 4 (Perfect). r is the ranking function: r (j) = i means document i is at position j. Dp is the discount function truncated at rank p, D(j) = 1/ log2 (j) if j ≤ p, 0 otherwise. G is the gain function, G (s) = 2s − 1.

Introduction

Continuous approximation

Structured output learning

Perspectives

Features Given a query and a document, construct a feature vector xi with 3 types of features: Query only : Type of query, query length,... Document only : Pagerank, length, spam,... Query & document : match score,... Set of q = 1, . . . , Q queries. Set of n triplets (query,document,score), (xi , si ), xi ∈ Rd , si ∈ {0, 1, 2, 3, 4}. Uq is the set of indices associated with query q-th query.

Introduction

Continuous approximation

Structured output learning

Approaches to ranking

Pointwise classification [Li et al. ’07], regression −→ works surprisingly well. Pairwise RankSVM perceptron [Crammer et al. ’03] neural nets: [Burges et al. ’05], LambdaRank, boosting: RankBoost, GBRank. Listwise Non metric specific: ListNet, ListMLE Metric specific: AdaRank; structured learning: SVMMAP, [Chapelle et al. ’07]; gradient descent: SoftRank, [Chapelle et al’ 09].

Perspectives

Introduction

Continuous approximation

Structured output learning

Two approaches for a direct optimization the DCG: 1

Gradient descent on a smooth approximation of the DCG

2

Large margin structured output learning where the loss function is the DCG.

Orthogonal issue: choice of the architecture. âˆ’â†’ for simplicity, linear functions. At the end, we will present non-linear extensions.

Perspectives

Introduction

Continuous approximation

Outline

1

Introduction

2

Continuous approximation

3

Structured output learning

4

Perspectives

Structured output learning

Perspectives

Introduction

Continuous approximation

Structured output learning

Main difficulty for a direct optimization (by gradient descent for instance): the DCG is not continuous and constant almost everywhere. −→ Continuous approximation of it.

DCG1 =

X i

≈

X i

I (i = arg max w> xj )G (si ) j

exp(w> xi /σ) P G (si ). > j exp(w xj /σ)

−→ ”Soft-argmax”; softness controlled by σ.

Perspectives

Introduction

Continuous approximation

Structured output learning

Perspectives

Generalization for DCGp : A(w, σ) :=

p X j=1

P D(j)

G (si )hij iP i

hij

with hij a ”smooth” version of the indicator function: ”Is xi at the j-th position in the ranking?”, ! ||w · xi − w · xr (j) ||2 . hij = exp − 2σ 2 σ controls amount of smoothing: when σ → 0, A(w, σ) → DCGp . A(w, σ) is continuous but non-differentiable. But it is differentiable almost everywhere −→ no problem for gradient descent. Approach generalizable to other IR metrics such as MAP.

Introduction

Continuous approximation

Structured output learning

Perspectives

Optimization by gradient descent and annealing: 1 - Initialize: w = w0 and large σ. 2 - Starting from w, minimize by (conjugate) gradient descent λ||w − w0 ||2 + A(w, σ). 3 - Divide σ by 2 and go back to 2 (or stop). w0 is an initial solution such as the one given by pairwise ranking. −11.84 σ=0.125 σ=1 σ=8 σ=64

−11.86

Objective function

−11.88 −11.9 −11.92 −11.94 −11.96 −11.98 −12 −12.02 −5

−4

−3

−2

−1

0 t

1

2

3

4

5

Introduction

Continuous approximation

Structured output learning

Perspectives

Evaluation on web search data Dataset Several hundred features. âˆź50k (query,urls) pairs from an international market. âˆź1500 queries randomly split in training / test (80% / 20%). 5 levels of relevance.

Introduction

Continuous approximation

Structured output learning

Perspectives

Evaluation on web search data Dataset Several hundred features. ∼50k (query,urls) pairs from an international market. ∼1500 queries randomly split in training / test (80% / 20%). 5 levels of relevance. 8.5 9.5

8.4 8.3 8.2 DCG5

DCG5

9

8.5

8

8

λ = 101

λ = 102

7.9

λ = 102

3

λ = 10

7.5 2

10

0

10 Smoothing factor σ

−2

10

8.1

λ = 101

4

λ = 10

7.8

λ = 105

7.7

λ = 103 λ = 104 λ = 105 2

10

0

−2

10 10 Smoothing factor σ

DCG can be improved by almost 10% on the training set (left), but not more than 1% on the test set (right).

Introduction

Continuous approximation

Structured output learning

Perspectives

Evaluation on Letor 3.0 Ohsumed dataset: NDCG 0.6

SmoothNDCG RankSVM Regression AdaRank−NDCG ListNet

0.58 SVMMAP

0.56

RankBoost

0.54 NDCG

ListNet FRank AdaRank−NDCG

0.52 0.5

AdaRank−MAP

0.48

Regression RankSVM

0.46

SmoothNDCG 0.42

0.425

0.43

0.435

0.44 0.445 NDCG@10

0.45

0.455

0.46

0.44 1

2

3

4

5 6 Position

7

8

9

10

0.53

0.54

0.55

All datasets: NDCG / MAP SVMMAP

SVMMAP

RankBoost

RankBoost

ListNet

ListNet

FRank

FRank

AdaRank−NDCG

AdaRank−NDCG

AdaRank−MAP

AdaRank−MAP

Regression

Regression

RankSVM

RankSVM

SmoothNDCG

SmoothNDCG

0.57

0.58

0.59

0.6 NDCG@10

0.61

0.62

0.63

0.47

0.48

0.49

0.5

0.51 MAP

0.52

Introduction

Continuous approximation

Outline

1

Introduction

2

Continuous approximation

3

Structured output learning Formulation Experiments & Extensions

4

Perspectives

Structured output learning

Perspectives

Introduction

Continuous approximation

Structured output learning

Perspectives

Structured output learning Notations xq is a set of documents associated with query q; xqi the i-th document. yq is a ranking (i.e. a permutation): yqi is the rank of the i-th document. (obtained by sorting the scores sqi ).

Learning for structured outputs (Tsochantaridis et al. ’04) Learn a mapping x → y Joint feature map: Ψ(x, y ). Prediction rule: yˆ = arg max w> Ψ(x, y ). y

Introduction

Continuous approximation

We take: Ψ(xq , yq ) =

P

i

Structured output learning

Perspectives

xqi A(yqi ).

A : N → R is a user defined non increasing function. Ranking is given by the order of w> xqi because P w> Ψ(x, y ) = i w> xqi A(yqi ). w> xqi A(y )

2.5 3.7 −0.5 × + × + × = 15.2 → max A(2) = 2 A(1) = 3 A(3) = 1

Constraints for correct predictions on the training set: ∀q, ∀y 6= yq , w> Ψ(xq , yq ) − w> Ψ(xq , y ) > 0.

Introduction

Continuous approximation

Structured output learning

Perspectives

SVM-like optimization problem: Q

min w,ξq

X λ ||w||2 + ξq , 2 q=1

under constraints: ∀q, ∀y 6= yq , w> Ψ(xq , yq ) − w> Ψ(xq , y ) ≥ ∆(y , yq ) − ξq , where ∆(y , yq ) is the query loss, e.g. the difference between the DCGs with ranking y and yq . At the optimum solution, ξq ≥ ∆(yˆq , yq ) with yˆq = arg max w> Ψ(xq , y ).

Introduction

Continuous approximation

Structured output learning

Perspectives

Optimization

ξq = max ∆(y , yq ) + w> Ψ(xq , y ) − w> Ψ(xq , yq ). y

Need to find the argmax: X y˜ = arg max A(yi )w> xqi − G (sqi )D(yi ). y

i

−→ Can be solved efficiently through a linear assignment problem.

Introduction

Continuous approximation

Structured output learning

Perspectives

Cutting plane Strategy used in SVMstruct . Iterate between: 1

Solving the problem on a subset of constraints.

2

Find and add (the most) violated constraints.

Unconstrained optimization X 1 max ∆(y , yq ) + w> Ψ(xq , y ) − w> Ψ(xq , yq ). min ||w||2 + y 2 q Convex, but not differentiable Subgradient descent Bundle method

Introduction

Continuous approximation

Structured output learning

Perspectives

Cutting plane Strategy used in SVMstruct . Iterate between: 1

Solving the problem on a subset of constraints.

2

Find and add (the most) violated constraints.

Unconstrained optimization X 1 max ∆(y , yq ) + w> Ψ(xq , y ) − w> Ψ(xq , yq ). min ||w||2 + y 2 q Convex, but not differentiable Subgradient descent Bundle method

Introduction

Continuous approximation

Structured output learning

Perspectives

Experiments

Normalized DCG for different truncation levels

0.53 0.52

SVMStruct Regression RankSVM

NDCGk

0.51

λ chosen on a validation set

0.5

A(r ) = max(r + 1 − k, 0).

0.49

Training time ∼20 minutes.

0.48 0.47 0.46

2

4

6 k

8

10

About 2% improvement (p-value = 0.03 vs regression, 0.07 vs RankSVM).

Introduction

Continuous approximation

Structured output learning

Ohsumed dataset (Letor distribution): 3 levels of relevance. 25 features 106 queries split in training / validation / test. Optimal solution is w = 0 even for small values of λ. Reason: there are a lot of constraints and not a lot of variables −→ the function looks like x → |x|. w> Ψ(x, y )

w> Ψ(x, y˜ )

w> Ψ(x, yˆ )

Perfect

Bad

Good

Large loss (because Perfect < Bad), but we would like a small (because Good is at the top).

Perspectives

Introduction

Continuous approximation

Structured output learning

Ohsumed dataset (Letor distribution): 3 levels of relevance. 25 features 106 queries split in training / validation / test. Optimal solution is w = 0 even for small values of λ. Reason: there are a lot of constraints and not a lot of variables −→ the function looks like x → |x|. w> Ψ(x, y )

w> Ψ(x, y˜ )

w> Ψ(x, yˆ )

Perfect

Bad

Good

Large loss (because Perfect < Bad), but we would like a small (because Good is at the top).

Perspectives

Introduction

Continuous approximation

Structured output learning

max w> Ψ(xi , y ) − w> Ψ(xi , yi ) + ∆(y , yi ) y

Perspectives

Introduction

Continuous approximation

min yˆ , ∆(ˆ y ,yi )=0

Structured output learning

max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yˆ ) y

Perspectives

Introduction

Continuous approximation

Structured output learning

min max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yˆ ) + ∆(ˆ y , yi ) yˆ

y

Perspectives

Introduction

Continuous approximation

Structured output learning

min max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yˆ ) + ∆(ˆ y , yi ) yˆ

y

= min max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yi ) yˆ

y

= max (w> Ψ(xi , y ) + ∆(y , yi )) − max w> Ψ(xi , yˆ ). y

yˆ

Perspectives

Introduction

Continuous approximation

Structured output learning

min max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yˆ ) + ∆(ˆ y , yi ) yˆ

y

= min max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yi ) yˆ

y

= max (w> Ψ(xi , y ) + ∆(y , yi )) − max w> Ψ(xi , yˆ ). y

yˆ

1

Smaller than the original loss: take yˆ = y .

2

Still an upper bound on the loss: take y = yˆ .

3

Non-convex.

−→ This upper bound can be used for any structured output learning problem. Details available in Tighter bounds for structured estimation [Do et al. ’09] and Optimization of ranking measures [Le et al.].

Perspectives

Introduction

Continuous approximation

Structured output learning

Perspectives

0.52 SVMStruct RankBoost RankSVM

0.51

Ohsumed dataset

0.5

NDCGk

0.49

w0 found by regression: serves a starting point and in the regularizer ||w âˆ’ w0 ||2 .

0.48 0.47 0.46 0.45 0.44 0.43

Optimization for DCG10 . 2

4

6 k

8

10

Introduction

Continuous approximation

Structured output learning

Perspectives

Non-linear extensions 1 2

The ”obvious” kernel trick. Gradient boosted decision trees. Frideman’s functional gradient boosting framework. ( N ) n X X min R(f ) = `(yj , f (xj )), F = α i hi . f ∈F

j=1

i=1

Typically hi is a tree and N is infinite. Cannot do direction optimization on F. f ←0 repeat ∂`(f (x ),y ) gj ← − ∂f (xj j ) j . Functional gradient Pn 2 ˆı ← arg mini,λ j=1 (gj − λhi (xj )) . Steepest component ρˆ ← arg min R(f + ρhˆı ). Line search ˆ f ← f + η ρˆh. ”Shrinkage” when η < 1. until Max iterations reached.

Introduction

Continuous approximation

Structured output learning

Perspectives

Non-linear extensions 1 2

The ”obvious” kernel trick. Gradient boosted decision trees. Frideman’s functional gradient boosting framework. ( N ) n X X min R(f ) = `(yj , f (xj )), F = α i hi . f ∈F

j=1

i=1

Typically hi is a tree and N is infinite. Cannot do direction optimization on F. f ←0 repeat ∂`(f (x ),y ) gj ← − ∂f (xj j ) j . Functional gradient Pn 2 ˆı ← arg mini,λ j=1 (gj − λhi (xj )) . Steepest component ρˆ ← arg min R(f + ρhˆı ). Line search ˆ f ← f + η ρˆh. ”Shrinkage” when η < 1. until Max iterations reached.

Introduction

Continuous approximation

Structured output learning

Perspectives

Non-linear extensions 1 2

The ”obvious” kernel trick. Gradient boosted decision trees. Frideman’s functional gradient boosting framework. ( N ) n X X min R(f ) = `(yj , f (xj )), F = α i hi . f ∈F

j=1

i=1

Typically hi is a tree and N is infinite. Cannot do direction optimization on F. f ←0 repeat ∂`(f (x ),y ) gj ← − ∂f (xj j ) j . Functional gradient Pn 2 ˆı ← arg mini,λ j=1 (gj − λhi (xj )) . Steepest component ρˆ ← arg min R(f + ρhˆı ). Line search ˆ f ← f + η ρˆh. ”Shrinkage” when η < 1. until Max iterations reached.

Introduction

Continuous approximation

Structured output learning

Perspectives

The objective function can be much more general: R(f (x1 ), . . . , f (xn )) gj = − ∂f∂R (xj ) . For ranking via structured output learning: R(f ) =

X q

max ∆(y , yq ) + y

Q X

! f (xqi )(A(yi ) − A(yqi )) .

i=1

Preliminary results are disappointing: with gradient boosted decision trees, no difference between regression and structured output learning. −→ Could be because the loss function matters only when the class of functions is restricted (underfitting).

Introduction

Continuous approximation

Structured output learning

Perspectives

The objective function can be much more general: R(f (x1 ), . . . , f (xn )) gj = − ∂f∂R (xj ) . For ranking via structured output learning: R(f ) =

X q

max ∆(y , yq ) + y

Q X

! f (xqi )(A(yi ) − A(yqi )) .

i=1

Preliminary results are disappointing: with gradient boosted decision trees, no difference between regression and structured output learning. −→ Could be because the loss function matters only when the class of functions is restricted (underfitting).

Introduction

Continuous approximation

Outline

1

Introduction

2

Continuous approximation

3

Structured output learning

4

Perspectives Objective function Future directions

Structured output learning

Perspectives

Introduction

Continuous approximation

Structured output learning

Perspectives

Choice of the objective function

General consensus on relative performance of learning to rank methods: Pointwise < Pairwise < Listwise True but ... the differences are very small. On real web search data using non-linear functions: Pairwise is ∼ 0.5 − 1% better than pointwise; Listwise is ∼ 0 − 0.5% better than pairwise. Letor datasets are interesting to test some ideas, but validation in a real setting is necessary.

Introduction

Continuous approximation

Structured output learning

Perspectives

Choice of the objective function

General consensus on relative performance of learning to rank methods: Pointwise < Pairwise < Listwise True but ... the differences are very small. On real web search data using non-linear functions: Pairwise is ∼ 0.5 − 1% better than pointwise; Listwise is ∼ 0 − 0.5% better than pairwise. Letor datasets are interesting to test some ideas, but validation in a real setting is necessary.

Introduction

Continuous approximation

Structured output learning

Public web search datasets Internet Mathematics 2009 Dataset released by the russian search engine Yandex for a competition. Available at: http://company.yandex.ru/grant/2009/en/datasets 9,124 queries / 97,290 judgements (training) 245 features 5 levels of relevance 132 submissions Yahoo! also plans to organize a similar competition and release datasets. Stay tuned!

Perspectives

Introduction

Continuous approximation

Structured output learning

Perspectives

To improve a ranking system, work in priority on: 1

Feature development

2

Choice of the function class

3

Choice of the objective function to optimize

But 1 and 2 are orthogonal issues to learning to rank. What are the other interesting problematics beyond the choice of the objective function?

Introduction

Continuous approximation

Structured output learning

Perspectives

Sample selection bias Training and offline test sets typically come from polling top results of other ranking functions. But online test documents come from a â€?largerâ€? distribution (all the documents from a simple ranking function). Problem: the learning algorithm does not learn to demote very bad pages (low BM25, spam,...) because they rarely appear in the training set. Solution: reweight the training set such that it resembles the online test distribution.

Introduction

Continuous approximation

Structured output learning

Perspectives

Diversity Output a set of relevant documents which is also diverse. Need to go beyond learning a relevance function. Structured output learning can be a principled framework for this purpose. But in any case, extra computational load at test time. Problem: no cheap metric for diversity. Diversity on content is more important than diversity on topic: a user can always reformulate an ambiguous query.

Introduction

Continuous approximation

Structured output learning

Perspectives

Transfer / multi-task learning How to leverage the data from one (big) market to another (small) one?

Introduction

Continuous approximation

Structured output learning

Perspectives

Cascade learning Ideally: rank all existing web pages. In practice: rank only a small subset of them using machine learning. Instead: build a â€?cascadeâ€? of rankers f1 , . . . , fT cascade of T rankers. All documents applied to f1 . Discard bottom documents after each round. Features and functions of increasing complexity. Each ranker is learned.

Introduction

Continuous approximation

Structured output learning

Perspectives

Low-level learning Two different ML philosophies: 1

Design a limited number of high-level features and put an ML algorithm on top of them.

2

Let the ML algorithm directly work on a large number of low-level features.

We have done 1, but 2 has been successful in various domains such as computer vision. Two ideas: Learn BM25 by introducing several parameters per word (such as the k in the saturation function). P Define the score match as i,j wij qi dj and learn the wij . See earlier talk Learning to rank with low rank.

Introduction

Continuous approximation

Structured output learning

Perspectives

Summary

Optimizing ranking measures is difficult, but feasible. Two types of approaches: convex upper bound or non-convex approximation. Only small improvements on real settings (large number of examples, large number of features, non-linear architecture). âˆ’â†’ Choice of the objective function has a small influence on the overall performance. Research on learning to rank should focus on new problems.