Issuu

Understanding Machine Learning From Theory to Algorithms 1st Edition Shwartz Solutions Manual Visit to Download in Full: https://testbankdeal.com/download/understanding-machine-l earning-from-theory-to-algorithms-1st-edition-shwartz-solutions-manual/

WrittenbyAlonGonen∗

EditedbyDanaRubinstein November17,2014

2GentleStart

1. Given S =((xi,yi))m i=1,deﬁnethemultivariatepolynomial

UnderstandingMachineLearning

SolutionManual

pS (x)= i∈[m]:yi=1 x xi 2 . Then,forevery i s.t. yi =1wehave pS (xi)=0,whileforeveryother x wehave pS (x) < 0.

Bythelinearityofexpectation, E S|x∼Dm [LS (h)]= E S|x∼Dm 1 m m i=1 1[h(xi)=f (xi)] = 1 m m i=1 E xi∼D[1[h(xi)=f (xi)]] = 1 m m i=1 P xi∼D[h(xi) = f (xi)] = 1 m m L(D,f )(h) = L(D,f )(h) ∗ThesolutionstoChapters13,14werewrittenbyShaiShalev-Shwartz 1 Understanding Machine Learning From Theory to Algorithms 1st Edition Shwartz Solutions Manual Visit TestBankDeal.com to get complete for all chapters

3. (a) First,observethatbydeﬁnition, A labelspositivelyallthepositiveinstancesinthetrainingset.Second,asweassumerealizability,andsincethetightestrectangleenclosingallpositiveexamples isreturned,allthenegativeinstancesarelabeledcorrectlyby A aswell.Weconcludethat A isanERM.

(b) Fixsomedistribution D over X ,anddeﬁne R asinthehint. Let f bethehypothesisassociatedwith R atrainingset S, denoteby R(S)therectanglereturnedbytheproposedalgorithm andby A(S)thecorrespondinghypothesis.Thedeﬁnitionofthe algorithm A impliesthat R(S) ⊆ R∗ forevery S.Thus,

Fixsome ∈ (0, 1).Deﬁne R1,R2,R3 and R4 asinthehint.For each i ∈ [4],deﬁnetheevent

Applyingtheunionbound,weobtain

Thus,itsuﬃcestoensurethat Dm(Fi) ≤ δ/4forevery i.Fix some i ∈ [4].Then,theprobabilitythatasampleisin Fi is theprobabilitythatalloftheinstancesdon’tfallin Ri,whichis exactly(1 /4)m.Therefore,

Dm(Fi)=(1 /4)m ≤ exp( m /4) , andhence,

Dm({S : L(D,f )(A(S)) > ]}) ≤ 4exp( m /4)

Pluggingintheassumptionon m,weconcludeourproof.

(c) Thehypothesisclassofaxisalignedrectanglesin Rd isdefinedas follows.Givenrealnumbers a1 ≤ b1,a2 ≤ b2,...,ad ≤ bd,define theclassifier h(a1,b1,...,ad,bd) by

h(a1,b1,...,ad,bd)(x1,...,xd)= 1if ∀i ∈ [d],ai ≤ xi ≤ bi 0otherwise (1)

L(D,f )(R(S))= D(R \ R(S))

Fi = {S|x : S|x ∩ Ri = ∅}

Dm({S : L(D,f )(A(S)) > }) ≤Dm 4 i

≤ 4 i

(Fi)

Theclassofallaxis-alignedrectanglesin Rd isdeﬁnedas

i, }

ItcanbeseenthatthesamealgorithmproposedaboveisanERM forthiscaseaswell.Thesamplecomplexityisanalyzedsimilarly. Theonlydiﬀerenceisthatinsteadof4strips,wehave2d strips (2stripsforeachdimension).Thus,itsuﬃcestodrawatraining setofsize 2d log(2d/δ)

(d) Foreachdimension,thealgorithmhastoﬁndtheminimaland themaximalvaluesamongthepositiveinstancesinthetraining sequence.Therefore,itsruntimeis O(md).Sincewehaveshown thattherequiredvalueof m isatmost 2d log(2d/δ) ,itfollows thattheruntimeofthealgorithmisindeedpolynomialin d, 1/ , andlog(1/δ).

3AFormalLearningModel

1. Theproofsfollow(almost)immediatelyfromthedeﬁnition.Wewill showthatthesamplecomplexityismonotonicallydecreasinginthe accuracyparameter .Theproofthatthesamplecomplexityismonotonicallydecreasingintheconﬁdenceparameter δ isanalogous.

Denoteby D anunknowndistributionover X ,andlet f ∈H bethe targethypothesis.Denoteby A analgorithmwhichlearns H with samplecomplexity mH( , ).Fixsome δ ∈ (0, 1).Supposethat0 < 1 ≤ 2 ≤ 1.Weneedtoshowthat m1 def = mH( 1,δ) ≥ mH( 2,δ) def = m2.Givenani.i.d.trainingsequenceofsize m ≥ m1,wehavethat withprobabilityatleast1 δ, A returnsahypothesis h suchthat

LD,f (h) ≤ 1 ≤ 2

Bytheminimalityof m2,weconcludethat m2 ≤ m1

2. (a) Weproposethefollowingalgorithm.Ifapositiveinstance x+ appearsin S,returnthe(true)hypothesis hx+ .If S doesn’tcontainanypositiveinstance,thealgorithmreturnstheall-negative hypothesis.ItisclearthatthisalgorithmisanERM.

(b) Let ∈ (0, 1),andﬁxthedistribution D over X .Ifthetrue hypothesisis h ,thenouralgorithmreturnsaperfecthypothesis.

Hd rec = {h(a1,b1,...,ad,bd) : ∀i ∈ [d],ai ≤ b

Assumenowthatthereexistsauniquepositiveinstance x+.It’s clearthatif x+ appearsinthetrainingsequence S,ouralgorithm returnsaperfecthypothesis.Furthermore,if D[{x+}] ≤ then inanycase,thereturnedhypothesishasageneralizationerror ofatmost (withprobability1).Thus,itisonlylefttobound theprobabilityofthecaseinwhich D[{x+}] > ,but x+ doesn’t appearin S.Denotethiseventby F .Then

Hence, HSingleton isPAClearnable,anditssamplecomplexityis boundedby

3. ConsidertheERMalgorithm A whichgivenatrainingsequence S = ((xi,yi))m i=1,returnsthehypothesis ˆ h correspondingtothe“tightest” circlewhichcontainsallthepositiveinstances.Denotetheradiusof thishypothesisbyˆ r.Assumerealizabilityandlet h beacirclewith zerogeneralizationerror.Denoteitsradiusby r

Let ,δ ∈ (0, 1).Let¯ r ≤ r∗ beascalars.t. DX ({x :¯ r ≤ x ≤ r })= .Deﬁne E = {x ∈ R2 :¯ r ≤ x ≤ r }.Theprobability(over drawing S)that LD(hS ) ≥ isboundedabovebytheprobabilitythat nopointin S belongsto E.Thisprobabilityofthiseventisbounded aboveby (1 )m ≤ e m

Thedesiredboundonthesamplecomplexityfollowsbyrequiringthat e m ≤ δ.

4. Weﬁrstobservethat H isﬁnite.Letuscalculateitssizeaccurately. Eachhypothesis,besidestheall-negativehypothesis,isdeterminedby decidingforeachvariable xi,whether xi,¯ xi ornoneofwhichappear inthecorrespondingconjunction.Thus, |H| =3d +1.Weconclude that H isPAClearnableanditssamplecomplexitycanbebounded by mH( ,δ) ≤ d log3+log(1/δ) .

Let’sdescribeourlearningalgorithm.Wedeﬁne h0 = x1 ∩ x1 ∩

∩ xd ∩ xd.Observethat h0 isthealways-minushypothesis.Let ((a1,y1),..., (am,ym))beani.i.d.trainingsequenceofsize m.Since

P S|x∼Dm [F ] ≤ (1 )m ≤ e m

mH( ,δ) ≤ log(1/δ)

wecannotproduceanyinformationfromnegativeexamples,ouralgorithmneglectsthem.Foreachpositiveexample a,weremovefrom hi alltheliteralsthataremissingin a.Thatis,if ai =1,weremove¯ xi from h andif ai =0,weremove xi from hi.Finally,ouralgorithm returns hm.

Byconstructionandrealizability, hi labelspositivelyallthepositive examplesamong a1 ,..., ai.Fromthesamereasons,thesetofliterals in hi containsthesetofliteralsinthetargethypothesis.Thus, hi classiﬁescorrectlythenegativeelementsamong a1 ,..., ai.Thisimplies that hm isanERM.

Sincethealgorithmtakeslineartime(intermsofthedimension d)to processeachexample,therunningtimeisboundedby O(m d).

5. Fixsome h ∈H with L(Dm,f )(h) > .Bydeﬁnition,

Wenowboundtheprobabilitythat h isconsistentwith S (i.e.,that LS (h)=0)asfollows:

Theﬁrstinequalityisthegeometric-arithmeticmeaninequality.Applyingtheunionbound,weconcludethattheprobabilitythatthere existssome h ∈H with L(Dm,f )(h) > ,whichisconsistentwith S is atmost |H| exp( m).

6. Supposethat H isagnosticPAClearnable,andlet A bealearning algorithmthatlearns H withsamplecomplexity mH(· , ·).Weshow that H isPAClearnableusing A.

PX∼D1 [h(X)= f (X)]+ + PX∼Dm [h(X)= f (X))] m < 1 .

P S∼ m i=1 Di [LS (h)=0]= m i=1 P X∼Di [h(X)= f (X)] =   m i=1 P X∼Di [h(X)= f (X)] 1 m   m ≤ m i=1 PX∼Di [h(X)= f (X)] m m < (1 )m ≤ e m

Let D,f bean(unknown)distributionover X ,andthetargetfunction respectively.Wemayassumew.l.o.g.that D isajointdistribution over X×{0, 1},wheretheconditionalprobabilityof y given x isdetermineddeterministicallyby f .Sinceweassumerealizability,wehave inf h∈H LD(h)=0.Let ,δ ∈ (0, 1).Then,foreverypositiveinteger m ≥ mH( ,δ),ifweequip A withatrainingset S consistingof m i.i.d. instanceswhicharelabeledby f ,thenwithprobabilityatleast1 δ (overthechoiceof S|x),itreturnsahypothesis h with

7. Let x ∈X .Let αx betheconditionalprobabilityofapositivelabel given x.Wehave

[g(X)=0|X = x] · min{αx, 1 αx}

+ P[g(X)=1|x] min{αx, 1 αx}

=min{αx, 1 αx},

Thestatementfollowsnowduetothefactthattheaboveistruefor every x ∈X .Moreformally,bythelawoftotalexpectation,

LD(fD)= E(x,y)∼D[1[fD (x)=y]]

= Ex∼DX Ey∼DY |x [1[fD (x)=y]|X = x]

= Ex∼DX [αx]

≤ Ex∼DX Ey∼DY |x [1[g(x)=y]|X = x]

= LD(g) .

Asweshallsee,

LD(h) ≤ inf h ∈H LD(h )+ =0+ = .

P[fD(X) = y|X = x]= 1[αx≥1/2] · P[Y =0|X = x]+ 1[αx<1/2] · P[Y =1|X = x] = 1[αx≥1/2] (1 αx)+ 1[αx<1/2] αx =min{αx, 1 αx} Let g beaclassiﬁer1

X to {0, 1}.Wehave P[g(X) = Y |X = x]= P[g(X)=0|X = x] · P[Y =1|X = x] + P[g(X)=1|X = x] · P[Y =0|X = x] = P[g(X)=0|X = x] · αx + P[g(X)=1|X = x] · (1 αx) ≥ P

from

1 g mightbenon-deterministic.

8. (a) Thiswasprovedinthepreviousexercise.

(b) Weprovedinthepreviousexercisethatforeverydistribution D, thebayesoptimalpredictor fD isoptimalw.r.t. D

9. (a) Supposethat H isPAClearnableintheone-oraclemodel.Let A beanalgorithmwhichlearns H anddenoteby mH thefunction thatdeterminesitssamplecomplexity.Weprovethat H isPAC learnablealsointhetwo-oraclemodel.

Let D beadistributionover X×{0, 1}.Notethatdrawingpoints fromthenegativeandpositiveoracleswithequalprovabilityis equivalenttoobtainingi.i.d.examplesfromadistribution D whichgivesequalprobabilitytopositiveandnegativeexamples. Formally,foreverysubset

Thisimpliesthatwithprobabilityatleast1

E ⊆X wehave D [E]= 1 2 D+[E]+ 1 2 D [E] Thus, D [{x : f (x)=1}]= D [{x : f (x)=0}]= 1 2 .Ifwelet A anaccesstoatrainingsetwhichisdrawni.i.d.accordingto D withsize mH( /2,δ),thenwithprobabilityatleast1

h with /2 ≥ L(D ,f )(h)= P x∼D [h(x) = f (x)] = P x∼D [f (x)=1,h(x)=0]+ P x∼D [f (x)=0,h(x)=1] = P x∼D [f (x)=1] · P x∼D [h(x)=0|f (x)=1] + P x∼D [f (x)=0] P x∼D [h(x)=1|f (x)=0] = P x∼D [f (x)=1] P x∼D[h(x)=0|f (x)=1] + P x∼D [f (x)=0] · P x∼D[h(x)=1|f (x)=0] = 1 2 · L(D+,f )(h)+ 1 2 · L(D ,f )(h)

δ, A returns

L(D+,f )(h) ≤ and L(D ,f )(h) ≤ . 7

,both

OurdeﬁnitionforPAClearnabilityinthetwo-oraclemodelissatisﬁed.Wecanboundboth m+ H( ,δ)and mH( ,δ)by mH( /2,δ).

(b) Supposethat H isPAClearnableinthetwo-oraclemodeland let A beanalgorithmwhichlearns H.Weshowthat H isPAC learnablealsointhestandardmodel.

Let D beadistributionover X ,anddenotethetargethypothesis by f .Let α = D[{x : f (x)=1}].Let ,δ ∈ (0, 1).Accordingtoourassumptions,thereexist m+ def = m+ H( ,δ/2),m def = mH( ,δ/2)s.t.ifweequip A with m+ examplesdrawni.i.d. from D+ and m examplesdrawni.i.d.from D ,then,with probabilityatleast1 δ/2, A willreturn h with

L(D+,f )(h) ≤ ∧ L(D ,f )(h) ≤ .

Ouralgorithm B draws m =max{2m+/ , 2m / , 8log(4/δ) } samplesaccordingto D.Iftherearelessthen m+ positiveexamples, B returns h .Otherwise,iftherearelessthen m negativeexamples, B returns h+.Otherwise, B runs A onthesampleand returnsthehypothesisreturnedby A.

Firstweobservethatifthesamplecontains m+ positiveinstances and m negativeinstances,thenthereductiontothetwo-oracle modelworkswell.Moreprecisely,withprobabilityatleast1 δ/2, A returns h with

L(D+,f )(h) ≤ ∧ L(D ,f )(h) ≤ . Hence,withprobabilityatleast1 δ/2,thealgorithm B returns (thesame) h with

L(D,f )(h)= α · L(D+,f )(h)+(1 α) · L(D ,f )(h) ≤ Weconsidernowthefollowingcases:

• Assumethatboth α ≥ .Weshowthatwithprobability atleast1 δ/4,thesamplecontain m+ positiveinstances. Foreach i =∈ [m],definetheindicatorrandomvariable Zi, whichgetsthevalue1iffthe i-thelementinthesampleis positive.Define Z = m i=1 Zi tobethenumberofpositive examplesthatweredrawn.Clearly, E[Z]= αm.UsingChernoffbound,weobtain

1 2 )αm] <e mα 8 . 8

Bythewaywechose m,weconcludethat

P[Z<m+] <δ/4

Similarly,if1 α ≥ ,theprobabilitythatlessthan m negativeexamplesweredrawnisatmost δ/4.Ifboth α ≥ and 1 α ≥ ,then,bytheunionbound,withprobabilityatleast 1 δ/2,thetrainingsetcontainsatleast m+ and m positiveandnegativeinstancesrespectively.Aswementioned above,ifthisisthecase,thereductiontothetwo-oracle modelworkswithprobabilityatleast1 δ/2.Thedesired conclusionfollowsbyapplyingtheunionbound.

• Assumethat α< ,andlessthan m+ positiveexamplesare drawn.Inthiscase, B willreturnthehypothesis h .We obtain

LD(h)= α< .

Similarly,if(1 α) < ,andlessthan m negativeexamples aredrawn, B willreturn h+.Inthiscase,

LD(h)=1 α< .

Allinall,wehaveshownthatwithprobabilityatleast1 δ, B returnsahypothesis h with L(D,f )(h) < .Thissatisﬁesour deﬁnitionforPAClearnabilityintheone-oraclemodel.

4LearningviaUniformConvergence

1. (a) Assumethatforevery ,δ ∈ (0, 1),andeverydistribution D over X×{0, 1},thereexists m( ,δ) ∈ N suchthatforevery m ≥ m( ,δ),

P S∼Dm[LD(A(S)) > ] <δ.

Let λ> 0.Weneedtoshowthatthereexists m0 ∈ N suchthatfor every m ≥ m0, ES∼Dm [LD(A(S))] ≤ λ.Let =min(1/2,λ/2). Set m0 = mH( , ).Forevery m ≥ m0,sincethelossisbounded

aboveby1,wehave

(b) Assumenowthat

2. TheleftinequalityfollowsfromCorollary 4.4.Weprovetheright inequality.Fixsome h ∈H.ApplyingHoeﬀding’sinequality,we obtain

Thedesiredinequalityisobtainedbyrequiringthattheright-hand sideofEquation(2)isatmost δ/|H|,andthenapplyingtheunion bound.

5TheBias-ComplexityTradeoﬀ

1. Wesimplyfollowthehint.ByLemma B.1,

E S∼Dm[LD(A(S))] ≤ P S∼Dm[LD(A(S)) >λ/2] 1+ P S∼Dm[LD(A(S)) ≤ λ/2] λ/2 ≤ P S∼Dm[LD(A(S)) > ]+ λ/2 ≤ + λ/2 ≤ λ/2+ λ/2 = λ.

lim m→∞ E S∼Dm[LD(A(S))]=0 Let ,δ ∈ (0, 1).Thereexistssome m0 ∈ N suchthatforevery m ≥ m0, ES∼Dm [LD(A(S))] ≤ · δ.ByMarkov’sinequality, P S∼Dm[LD(A(S)) > ] ≤ ES∼Dm [LD(A(S))] ≤ δ = δ.

P S∼Dm[|LD(h) LS (h)|≥ /2] ≤ 2exp 2m 2 (b a)2 (2)

P S∼Dm

LD(A(S

P S∼Dm[LD(A(S)) ≥ 1 7/

≥ E[LD(A(S))] (1 7

7/8 ≥ 1/8 7/8 =1/7 10 Understanding Machine Learning From Theory to Algorithms 1st Edition Shwartz Solutions Manual Visit TestBankDeal.com to get complete for all chapters

[

)) ≥ 1/8]=

/8)