Understanding machine learning from theory to algorithms 1st edition shwartz solutions manual

Page 1

Understanding Machine Learning From Theory to Algorithms 1st Edition Shwartz Solutions Manual Visit to Download in Full: https://testbankdeal.com/download/understanding-machine-l earning-from-theory-to-algorithms-1st-edition-shwartz-solutions-manual/

WrittenbyAlonGonen∗

EditedbyDanaRubinstein November17,2014

2GentleStart

1. Given S =((xi,yi))m i=1,definethemultivariatepolynomial

UnderstandingMachineLearning
SolutionManual
pS (x)= i∈[m]:yi=1 x xi 2 . Then,forevery i s.t. yi =1wehave pS (xi)=0,whileforeveryother x wehave pS (x) < 0.
Bythelinearityofexpectation, E S|x∼Dm [LS (h)]= E S|x∼Dm 1 m m i=1 1[h(xi)=f (xi)] = 1 m m i=1 E xi∼D[1[h(xi)=f (xi)]] = 1 m m i=1 P xi∼D[h(xi) = f (xi)] = 1 m m L(D,f )(h) = L(D,f )(h) ∗ThesolutionstoChapters13,14werewrittenbyShaiShalev-Shwartz 1 Understanding Machine Learning From Theory to Algorithms 1st Edition Shwartz Solutions Manual Visit TestBankDeal.com to get complete for all chapters
2.

3. (a) First,observethatbydefinition, A labelspositivelyallthepositiveinstancesinthetrainingset.Second,asweassumerealizability,andsincethetightestrectangleenclosingallpositiveexamples isreturned,allthenegativeinstancesarelabeledcorrectlyby A aswell.Weconcludethat A isanERM.

(b) Fixsomedistribution D over X ,anddefine R asinthehint. Let f bethehypothesisassociatedwith R atrainingset S, denoteby R(S)therectanglereturnedbytheproposedalgorithm andby A(S)thecorrespondinghypothesis.Thedefinitionofthe algorithm A impliesthat R(S) ⊆ R∗ forevery S.Thus,

Fixsome ∈ (0, 1).Define R1,R2,R3 and R4 asinthehint.For each i ∈ [4],definetheevent

Applyingtheunionbound,weobtain

Thus,itsufficestoensurethat Dm(Fi) ≤ δ/4forevery i.Fix some i ∈ [4].Then,theprobabilitythatasampleisin Fi is theprobabilitythatalloftheinstancesdon’tfallin Ri,whichis exactly(1 /4)m.Therefore,

Dm(Fi)=(1 /4)m ≤ exp( m /4) , andhence,

Dm({S : L(D,f )(A(S)) > ]}) ≤ 4exp( m /4)

Pluggingintheassumptionon m,weconcludeourproof.

(c) Thehypothesisclassofaxisalignedrectanglesin Rd isdefinedas follows.Givenrealnumbers a1 ≤ b1,a2 ≤ b2,...,ad ≤ bd,define theclassifier h(a1,b1,...,ad,bd) by

h(a1,b1,...,ad,bd)(x1,...,xd)= 1if ∀i ∈ [d],ai ≤ xi ≤ bi 0otherwise (1)

L(D,f )(R(S))= D(R \ R(S))
Fi = {S|x : S|x ∩ Ri = ∅}
Dm({S : L(D,f )(A(S)) > }) ≤Dm 4 i
F
≤ 4 i
Dm
=1
i
=1
(Fi)
2

Theclassofallaxis-alignedrectanglesin Rd isdefinedas

i, }

ItcanbeseenthatthesamealgorithmproposedaboveisanERM forthiscaseaswell.Thesamplecomplexityisanalyzedsimilarly. Theonlydifferenceisthatinsteadof4strips,wehave2d strips (2stripsforeachdimension).Thus,itsufficestodrawatraining setofsize 2d log(2d/δ)

(d) Foreachdimension,thealgorithmhastofindtheminimaland themaximalvaluesamongthepositiveinstancesinthetraining sequence.Therefore,itsruntimeis O(md).Sincewehaveshown thattherequiredvalueof m isatmost 2d log(2d/δ) ,itfollows thattheruntimeofthealgorithmisindeedpolynomialin d, 1/ , andlog(1/δ).

3AFormalLearningModel

1. Theproofsfollow(almost)immediatelyfromthedefinition.Wewill showthatthesamplecomplexityismonotonicallydecreasinginthe accuracyparameter .Theproofthatthesamplecomplexityismonotonicallydecreasingintheconfidenceparameter δ isanalogous.

Denoteby D anunknowndistributionover X ,andlet f ∈H bethe targethypothesis.Denoteby A analgorithmwhichlearns H with samplecomplexity mH( , ).Fixsome δ ∈ (0, 1).Supposethat0 < 1 ≤ 2 ≤ 1.Weneedtoshowthat m1 def = mH( 1,δ) ≥ mH( 2,δ) def = m2.Givenani.i.d.trainingsequenceofsize m ≥ m1,wehavethat withprobabilityatleast1 δ, A returnsahypothesis h suchthat

LD,f (h) ≤ 1 ≤ 2

Bytheminimalityof m2,weconcludethat m2 ≤ m1

2. (a) Weproposethefollowingalgorithm.Ifapositiveinstance x+ appearsin S,returnthe(true)hypothesis hx+ .If S doesn’tcontainanypositiveinstance,thealgorithmreturnstheall-negative hypothesis.ItisclearthatthisalgorithmisanERM.

(b) Let ∈ (0, 1),andfixthedistribution D over X .Ifthetrue hypothesisis h ,thenouralgorithmreturnsaperfecthypothesis.

Hd rec = {h(a1,b1,...,ad,bd) : ∀i ∈ [d],ai ≤ b
3

Assumenowthatthereexistsauniquepositiveinstance x+.It’s clearthatif x+ appearsinthetrainingsequence S,ouralgorithm returnsaperfecthypothesis.Furthermore,if D[{x+}] ≤ then inanycase,thereturnedhypothesishasageneralizationerror ofatmost (withprobability1).Thus,itisonlylefttobound theprobabilityofthecaseinwhich D[{x+}] > ,but x+ doesn’t appearin S.Denotethiseventby F .Then

Hence, HSingleton isPAClearnable,anditssamplecomplexityis boundedby

3. ConsidertheERMalgorithm A whichgivenatrainingsequence S = ((xi,yi))m i=1,returnsthehypothesis ˆ h correspondingtothe“tightest” circlewhichcontainsallthepositiveinstances.Denotetheradiusof thishypothesisbyˆ r.Assumerealizabilityandlet h beacirclewith zerogeneralizationerror.Denoteitsradiusby r

Let ,δ ∈ (0, 1).Let¯ r ≤ r∗ beascalars.t. DX ({x :¯ r ≤ x ≤ r })= .Define E = {x ∈ R2 :¯ r ≤ x ≤ r }.Theprobability(over drawing S)that LD(hS ) ≥ isboundedabovebytheprobabilitythat nopointin S belongsto E.Thisprobabilityofthiseventisbounded aboveby (1 )m ≤ e m

Thedesiredboundonthesamplecomplexityfollowsbyrequiringthat e m ≤ δ.

4. Wefirstobservethat H isfinite.Letuscalculateitssizeaccurately. Eachhypothesis,besidestheall-negativehypothesis,isdeterminedby decidingforeachvariable xi,whether xi,¯ xi ornoneofwhichappear inthecorrespondingconjunction.Thus, |H| =3d +1.Weconclude that H isPAClearnableanditssamplecomplexitycanbebounded by mH( ,δ) ≤ d log3+log(1/δ) .

Let’sdescribeourlearningalgorithm.Wedefine h0 = x1 ∩ x1 ∩

∩ xd ∩ xd.Observethat h0 isthealways-minushypothesis.Let ((a1,y1),..., (am,ym))beani.i.d.trainingsequenceofsize m.Since

P S|x∼Dm [F ] ≤ (1 )m ≤ e m
mH( ,δ) ≤ log(1/δ)
.
4

wecannotproduceanyinformationfromnegativeexamples,ouralgorithmneglectsthem.Foreachpositiveexample a,weremovefrom hi alltheliteralsthataremissingin a.Thatis,if ai =1,weremove¯ xi from h andif ai =0,weremove xi from hi.Finally,ouralgorithm returns hm.

Byconstructionandrealizability, hi labelspositivelyallthepositive examplesamong a1 ,..., ai.Fromthesamereasons,thesetofliterals in hi containsthesetofliteralsinthetargethypothesis.Thus, hi classifiescorrectlythenegativeelementsamong a1 ,..., ai.Thisimplies that hm isanERM.

Sincethealgorithmtakeslineartime(intermsofthedimension d)to processeachexample,therunningtimeisboundedby O(m d).

5. Fixsome h ∈H with L(Dm,f )(h) > .Bydefinition,

Wenowboundtheprobabilitythat h isconsistentwith S (i.e.,that LS (h)=0)asfollows:

Thefirstinequalityisthegeometric-arithmeticmeaninequality.Applyingtheunionbound,weconcludethattheprobabilitythatthere existssome h ∈H with L(Dm,f )(h) > ,whichisconsistentwith S is atmost |H| exp( m).

6. Supposethat H isagnosticPAClearnable,andlet A bealearning algorithmthatlearns H withsamplecomplexity mH(· , ·).Weshow that H isPAClearnableusing A.

PX∼D1 [h(X)= f (X)]+ + PX∼Dm [h(X)= f (X))] m < 1 .
P S∼ m i=1 Di [LS (h)=0]= m i=1 P X∼Di [h(X)= f (X)] =   m i=1 P X∼Di [h(X)= f (X)] 1 m   m ≤ m i=1 PX∼Di [h(X)= f (X)] m m < (1 )m ≤ e m
5

Let D,f bean(unknown)distributionover X ,andthetargetfunction respectively.Wemayassumew.l.o.g.that D isajointdistribution over X×{0, 1},wheretheconditionalprobabilityof y given x isdetermineddeterministicallyby f .Sinceweassumerealizability,wehave inf h∈H LD(h)=0.Let ,δ ∈ (0, 1).Then,foreverypositiveinteger m ≥ mH( ,δ),ifweequip A withatrainingset S consistingof m i.i.d. instanceswhicharelabeledby f ,thenwithprobabilityatleast1 δ (overthechoiceof S|x),itreturnsahypothesis h with

7. Let x ∈X .Let αx betheconditionalprobabilityofapositivelabel given x.Wehave

[g(X)=0|X = x] · min{αx, 1 αx}

+ P[g(X)=1|x] min{αx, 1 αx}

=min{αx, 1 αx},

Thestatementfollowsnowduetothefactthattheaboveistruefor every x ∈X .Moreformally,bythelawoftotalexpectation,

LD(fD)= E(x,y)∼D[1[fD (x)=y]]

= Ex∼DX Ey∼DY |x [1[fD (x)=y]|X = x]

= Ex∼DX [αx]

≤ Ex∼DX Ey∼DY |x [1[g(x)=y]|X = x]

= LD(g) .

Asweshallsee,

LD(h) ≤ inf h ∈H LD(h )+ =0+ = .
P[fD(X) = y|X = x]= 1[αx≥1/2] · P[Y =0|X = x]+ 1[αx<1/2] · P[Y =1|X = x] = 1[αx≥1/2] (1 αx)+ 1[αx<1/2] αx =min{αx, 1 αx} Let g beaclassifier1
X to {0, 1}.Wehave P[g(X) = Y |X = x]= P[g(X)=0|X = x] · P[Y =1|X = x] + P[g(X)=1|X = x] · P[Y =0|X = x] = P[g(X)=0|X = x] · αx + P[g(X)=1|X = x] · (1 αx) ≥ P
from
6
1 g mightbenon-deterministic.

8. (a) Thiswasprovedinthepreviousexercise.

(b) Weprovedinthepreviousexercisethatforeverydistribution D, thebayesoptimalpredictor fD isoptimalw.r.t. D

(c) Chooseanydistribution D.Then A isnotbetterthan fD w.r.t.

D.

9. (a) Supposethat H isPAClearnableintheone-oraclemodel.Let A beanalgorithmwhichlearns H anddenoteby mH thefunction thatdeterminesitssamplecomplexity.Weprovethat H isPAC learnablealsointhetwo-oraclemodel.

Let D beadistributionover X×{0, 1}.Notethatdrawingpoints fromthenegativeandpositiveoracleswithequalprovabilityis equivalenttoobtainingi.i.d.examplesfromadistribution D whichgivesequalprobabilitytopositiveandnegativeexamples. Formally,foreverysubset

Thisimpliesthatwithprobabilityatleast1

E ⊆X wehave D [E]= 1 2 D+[E]+ 1 2 D [E] Thus, D [{x : f (x)=1}]= D [{x : f (x)=0}]= 1 2 .Ifwelet A anaccesstoatrainingsetwhichisdrawni.i.d.accordingto D withsize mH( /2,δ),thenwithprobabilityatleast1
h with /2 ≥ L(D ,f )(h)= P x∼D [h(x) = f (x)] = P x∼D [f (x)=1,h(x)=0]+ P x∼D [f (x)=0,h(x)=1] = P x∼D [f (x)=1] · P x∼D [h(x)=0|f (x)=1] + P x∼D [f (x)=0] P x∼D [h(x)=1|f (x)=0] = P x∼D [f (x)=1] P x∼D[h(x)=0|f (x)=1] + P x∼D [f (x)=0] · P x∼D[h(x)=1|f (x)=0] = 1 2 · L(D+,f )(h)+ 1 2 · L(D ,f )(h)
δ, A returns
δ
L(D+,f )(h) ≤ and L(D ,f )(h) ≤ . 7
,both

OurdefinitionforPAClearnabilityinthetwo-oraclemodelissatisfied.Wecanboundboth m+ H( ,δ)and mH( ,δ)by mH( /2,δ).

(b) Supposethat H isPAClearnableinthetwo-oraclemodeland let A beanalgorithmwhichlearns H.Weshowthat H isPAC learnablealsointhestandardmodel.

Let D beadistributionover X ,anddenotethetargethypothesis by f .Let α = D[{x : f (x)=1}].Let ,δ ∈ (0, 1).Accordingtoourassumptions,thereexist m+ def = m+ H( ,δ/2),m def = mH( ,δ/2)s.t.ifweequip A with m+ examplesdrawni.i.d. from D+ and m examplesdrawni.i.d.from D ,then,with probabilityatleast1 δ/2, A willreturn h with

L(D+,f )(h) ≤ ∧ L(D ,f )(h) ≤ .

Ouralgorithm B draws m =max{2m+/ , 2m / , 8log(4/δ) } samplesaccordingto D.Iftherearelessthen m+ positiveexamples, B returns h .Otherwise,iftherearelessthen m negativeexamples, B returns h+.Otherwise, B runs A onthesampleand returnsthehypothesisreturnedby A.

Firstweobservethatifthesamplecontains m+ positiveinstances and m negativeinstances,thenthereductiontothetwo-oracle modelworkswell.Moreprecisely,withprobabilityatleast1 δ/2, A returns h with

L(D+,f )(h) ≤ ∧ L(D ,f )(h) ≤ . Hence,withprobabilityatleast1 δ/2,thealgorithm B returns (thesame) h with

L(D,f )(h)= α · L(D+,f )(h)+(1 α) · L(D ,f )(h) ≤ Weconsidernowthefollowingcases:

• Assumethatboth α ≥ .Weshowthatwithprobability atleast1 δ/4,thesamplecontain m+ positiveinstances. Foreach i =∈ [m],definetheindicatorrandomvariable Zi, whichgetsthevalue1iffthe i-thelementinthesampleis positive.Define Z = m i=1 Zi tobethenumberofpositive examplesthatweredrawn.Clearly, E[Z]= αm.UsingChernoffbound,weobtain

Z<
1 2 )αm] <e mα 8 . 8
P[
(1

Bythewaywechose m,weconcludethat

P[Z<m+] <δ/4

Similarly,if1 α ≥ ,theprobabilitythatlessthan m negativeexamplesweredrawnisatmost δ/4.Ifboth α ≥ and 1 α ≥ ,then,bytheunionbound,withprobabilityatleast 1 δ/2,thetrainingsetcontainsatleast m+ and m positiveandnegativeinstancesrespectively.Aswementioned above,ifthisisthecase,thereductiontothetwo-oracle modelworkswithprobabilityatleast1 δ/2.Thedesired conclusionfollowsbyapplyingtheunionbound.

• Assumethat α< ,andlessthan m+ positiveexamplesare drawn.Inthiscase, B willreturnthehypothesis h .We obtain

LD(h)= α< .

Similarly,if(1 α) < ,andlessthan m negativeexamples aredrawn, B willreturn h+.Inthiscase,

LD(h)=1 α< .

Allinall,wehaveshownthatwithprobabilityatleast1 δ, B returnsahypothesis h with L(D,f )(h) < .Thissatisfiesour definitionforPAClearnabilityintheone-oraclemodel.

4LearningviaUniformConvergence

1. (a) Assumethatforevery ,δ ∈ (0, 1),andeverydistribution D over X×{0, 1},thereexists m( ,δ) ∈ N suchthatforevery m ≥ m( ,δ),

P S∼Dm[LD(A(S)) > ] <δ.

Let λ> 0.Weneedtoshowthatthereexists m0 ∈ N suchthatfor every m ≥ m0, ES∼Dm [LD(A(S))] ≤ λ.Let =min(1/2,λ/2). Set m0 = mH( , ).Forevery m ≥ m0,sincethelossisbounded

9

aboveby1,wehave

(b) Assumenowthat

2. TheleftinequalityfollowsfromCorollary 4.4.Weprovetheright inequality.Fixsome h ∈H.ApplyingHoeffding’sinequality,we obtain

Thedesiredinequalityisobtainedbyrequiringthattheright-hand sideofEquation(2)isatmost δ/|H|,andthenapplyingtheunion bound.

5TheBias-ComplexityTradeoff

1. Wesimplyfollowthehint.ByLemma B.1,

E S∼Dm[LD(A(S))] ≤ P S∼Dm[LD(A(S)) >λ/2] 1+ P S∼Dm[LD(A(S)) ≤ λ/2] λ/2 ≤ P S∼Dm[LD(A(S)) > ]+ λ/2 ≤ + λ/2 ≤ λ/2+ λ/2 = λ.
lim m→∞ E S∼Dm[LD(A(S))]=0 Let ,δ ∈ (0, 1).Thereexistssome m0 ∈ N suchthatforevery m ≥ m0, ES∼Dm [LD(A(S))] ≤ · δ.ByMarkov’sinequality, P S∼Dm[LD(A(S)) > ] ≤ ES∼Dm [LD(A(S))] ≤ δ = δ.
P S∼Dm[|LD(h) LS (h)|≥ /2] ≤ 2exp 2m 2 (b a)2 (2)
P S∼Dm
LD(A(S
P S∼Dm[LD(A(S)) ≥ 1 7/
≥ E[LD(A(S))] (1 7
7/8 ≥ 1/8 7/8 =1/7 10 Understanding Machine Learning From Theory to Algorithms 1st Edition Shwartz Solutions Manual Visit TestBankDeal.com to get complete for all chapters
[
)) ≥ 1/8]=
8]
/8)

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.
Understanding machine learning from theory to algorithms 1st edition shwartz solutions manual by gloria.brown507 - Issuu