essential-guide-to-pandas

Page 1


EssentialGuide toPandas

EfficientdataanalysiswithPython.

Learnhowtomanipulate,analyze,andvisualizedataefficientlywith pandas.AcompletepracticalguidetodataanalysiswithPython

EssentialGuideto Pandas

IbonMartínez-Arranz

EssentialGuidetoSeaborn

Introduction

Welcometoourin-depthmanualonPandas,acornerstonePythonlibrarythatis indispensableintherealmsofdatascienceandanalysis.Pandasprovidesarich setoftoolsandfunctionsthatmakedataanalysis,manipulation,andvisualization bothaccessibleandpowerful.

Pandas,shortfor“PanelData”,isanopen-sourcelibrarythato ershigh-leveldata structuresandavastarrayoftoolsforpracticaldataanalysisinPython.Ithas becomesynonymouswithdatawrangling,o eringtheDataFrameasitscentral datastructure,whichise ectivelyatableoratwo-dimensional,size-mutable, andpotentiallyheterogeneoustabulardatastructurewithlabeledaxes(rowsand columns).

TobeginusingPandas,it’stypicallyimportedalongsideNumPy,anotherkeylibraryfornumericalcomputations.TheconventionalwaytoimportPandasisas follows:

1 import pandasaspd

2 import numpyasnp

Inthismanual,wewillexplorethemultifacetedfeaturesofPandas,coveringawide rangeoffunctionalitiesthatcatertotheneedsofdataanalystsandscientists.Our guidewillwalkyouthroughthefollowingkeyareas:

1. DataLoading: Learnhowtoe icientlyimportdataintoPandasfromdi erent sourcessuchasCSVfiles,Excelsheets,anddatabases.

2. BasicDataInspection: Understandthestructureandcontentofyourdata throughsimpleyetpowerfulinspectiontechniques.

3. DataCleaning: Learntoidentifyandrectifyinconsistencies,missingvalues, andanomaliesinyourdataset,ensuringdataqualityandreliability.

4. DataTransformation: Discovermethodstoreshape,aggregate,andmodify datatosuityouranalyticalneeds.

5. DataVisualization: IntegratePandaswithvisualizationtoolstocreateinsightfulandcompellinggraphicalrepresentationsofyourdata.

6. StatisticalAnalysis: UtilizePandasfordescriptiveandinferentialstatistics, makingdata-drivendecisionseasierandmoreaccurate.

7. IndexingandSelection: Mastertheartofaccessingandselectingdata subsetse icientlyforanalysis.

8. DataFormattingandConversion: Adaptyourdataintothedesiredformat, enhancingitsusabilityandcompatibilitywithdi erentanalysistools.

9. AdvancedDataTransformation: Delvedeeperintosophisticateddata transformationtechniquesforcomplexdatamanipulationtasks.

10. HandlingTimeSeriesData: Explorethehandlingoftime-stampeddata, crucialfortimeseriesanalysisandforecasting.

11. FileImport/Export: Learnhowtoe ortlesslyreadfromandwritetovarious fileformats,makingdatainterchangeseamless.

12. AdvancedQueries: Employadvancedqueryingtechniquestoextractspecific insightsfromlargedatasets.

13. Multi-IndexOperations: Understandthemulti-levelindexingtoworkwith high-dimensionaldatamoree ectively.

14. DataMergingTechniques: Explorevariousstrategiestocombinedatasets, enhancingyouranalyticalpossibilities.

15. DealingwithDuplicates: Detectandhandleduplicaterecordstomaintain theintegrityofyouranalysis.

16. CustomOperationswithApply: Harnessthepowerofcustomfunctionsto extendPandas’capabilities.

17. IntegrationwithMatplotlibforCustomPlots: Createbespokeplotsby integratingPandaswithMatplotlib,aleadingplottinglibrary.

18. AdvancedGroupingandAggregation: Performcomplexgroupingandaggregationoperationsforsophisticateddatasummaries.

19. TextDataSpecificOperations: Manipulateandanalyzetextualdatae ectivelyusingPandas’stringfunctions.

20. WorkingwithJSONandXML: HandlemoderndataformatslikeJSONand XMLwithease.

21. AdvancedFileHandling: LearnadvancedtechniquesformanagingfileI/O operations.

22. DealingwithMissingData: Developstrategiestoaddressandimputemissingvaluesinyourdatasets.

23. DataReshaping: Transformthestructureofyourdatatofacilitatedi erent typesofanalysis.

24. CategoricalDataOperations: E icientlymanageandanalyzecategorical data.

25. AdvancedIndexing: Leverageadvancedindexingtechniquesformorepowerfuldatamanipulation.

26. E icientComputations: Optimizeperformanceforlarge-scaledataoperations.

27. AdvancedDataMerging: Exploresophisticateddatamergingandjoining techniquesforcomplexdatasets.

28. DataQualityChecks: Implementstrategiestoensureandmaintainthe qualityofyourdatathroughouttheanalysisprocess.

29. Real-WorldCaseStudies:Applytheconceptsandtechniqueslearned throughoutthemanualtoreal-worldscenariosusingtheTitanicdataset. Thischapterdemonstratespracticaldataanalysisworkflows,includingdata cleaning,exploratoryanalysis,andsurvivalanalysis,providinginsights intohowtoutilizePandasinpracticalapplicationstoderivemeaningful conclusionsfromcomplexdatasets.

Thismanualisdesignedtoempoweryouwiththeknowledgeandskillstoe ectively manipulateandanalyzedatausingPandas,turningrawdataintovaluableinsights. Let’sbeginourjourneyintotheworldofdataanalysiswithPandas.

Pandas,beingacornerstoneinthePythondataanalysislandscape,hasawealth ofresourcesandreferencesavailableforthoselookingtodelvedeeperintoits capabilities.Belowaresomekeyreferencesandresourceswhereyoucanfind additionalinformation,documentation,andsupportforworkingwithPandas:

1. O icialPandasWebsiteandDocumentation:

• Theo icialwebsiteforPandasispandas.pydata.org.Here,youcanfind comprehensivedocumentation,includingadetaileduserguide,API reference,andnumeroustutorials.Thedocumentationisaninvaluable resourceforbothbeginnersandexperiencedusers,o eringdetailed explanationsofPandas’functionalitiesalongwithexamples.

2. PandasGitHubRepository:

• ThePandasGitHubrepository,github.com/pandas-dev/pandas,isthe primarysourceofthelatestsourcecode.It’salsoahubforthedevelopmentcommunitywhereyoucanreportissues,contributetothe codebase,andreviewupcomingfeatures.

3. PandasCommunityandSupport:

• StackOverflow: Alargenumberofquestionsandanswerscanbefound underthe‘pandas’tagonStackOverflow.It’sagreatplacetoseekhelp andcontributetocommunitydiscussions.

• MailingList: Pandashasanactivemailinglistfordiscussionandasking questionsaboutusageanddevelopment.

• SocialMedia: FollowPandasonplatformslikeTwitterforupdates,tips, andcommunityinteractions.

4. ScientificPythonEcosystem:

• Pandasisapartofthelargerecosystemofscientificcomputingin Python,whichincludeslibrarieslikeNumPy,SciPy,Matplotlib,and IPython.UnderstandingtheselibrariesinconjunctionwithPandascan behighlybeneficial.

5. BooksandOnlineCourses:

• Therearenumerousbooksandonlinecoursesavailablethatcover Pandas,o enwithinthebroadercontextofPythondataanalysisand datascience.Thesecanbeexcellentresourcesforstructuredlearning andin-depthunderstanding.

6. CommunityConferencesandMeetups:

• Pythonanddatascienceconferenceso enfeaturetalksandworkshops onPandas.LocalPythonmeetupscanalsobeagoodplacetolearn fromandnetworkwithotherusers.

7. JupyterNotebooks:

• ManyonlinerepositoriesandplatformshostJupyterNotebooksshowcasingPandasusecases.Theseinteractivenotebooksareexcellentfor learningbyexampleandexperimentingwithcode.

Byexploringtheseresources,youcandeepenyourunderstandingofPandas,stay updatedwiththelatestdevelopments,andconnectwithavibrantcommunityof usersandcontributors.

DataLoading

E icientdataloadingisfundamentaltoanydataanalysisprocess.Pandaso ers severalfunctionstoreaddatafromdi erentformats,makingiteasiertomanipulate andanalyzethedata.Inthischapter,wewillexplorehowtoreaddatafromCSV files,Excelfiles,andSQLdatabasesusingPandas.

ReadCSVFile

The read_csv functionisusedtoloaddatafromCSVfilesintoaDataFrame.This functionishighlycustomizablewithnumerousparameterstohandledi erent formatsanddatatypes.Hereisabasicexample:

1 import pandasaspd

2

3 #LoaddatafromaCSVfileintoaDataFrame

4 df = pd.read_csv('filename.csv')

Thiscommandreadsdatafrom‘filename.csv’andstoresitintheDataFrame df. ThefilepathcanbeaURLoralocalfilepath.

ToreaddatafromanExcelfile,usethe read_excel function.Thisfunction supportsreadingfrombothxlsandxlsxfileformatsandallowsyoutospecifythe

sheettobeloaded.

1 #LoaddatafromanExcelfileintoaDataFrame

2 df = pd.read_excel('filename.xlsx')

ThisreadsthefirstsheetintheExcelworkbook‘filename.xlsx’bydefault.Youcan specifyadi erentsheetbyusingthe sheet_name parameter.

ReadfromSQLDatabase

PandascanalsoloaddatadirectlyfromaSQLdatabaseusingthe read_sql function.ThisfunctionrequiresaSQLqueryandaconnectionobjecttothedatabase.

1 import sqlalchemy 2

3 #CreateaconnectiontoaSQLdatabase

4 engine = sqlalchemy.create_engine('sqlite:///example.db')

5 query = "SELECT*FROMmy_table" 6

7 #LoaddatafromaSQLdatabaseintoaDataFrame

8 df = pd.read_sql(query, engine)

ThisexampledemonstrateshowtoconnecttoaSQLitedatabaseandreaddata from‘my_table’intoaDataFrame.

BasicDataInspection

DisplayTopRows(df.head())

Thiscommand,df.head(),displaysthefirstfiverowsoftheDataFrame, providingaquickglimpseofthedata,includingcolumnnamesandsomeofthe values.

1 ABCDE 2 0810.692744 Yes 2023-01-01-1.082325

3 1540.316586 Yes 2023-01-020.031455

4 2570.860911 Yes 2023-01-03-2.599667

5 360.182256 No 2023-01-04-0.603517

6 4820.210502 No 2023-01-05-0.484947

DisplayBottomRows(df.tail())

Thiscommand,df.tail(),showsthelastfiverowsoftheDataFrame,usefulfor checkingtheendofyourdataset.

1 ABCDE 2 5730.463415 No 2023-01-06-0.442890

3 6130.513276 No 2023-01-07-0.289926 4 7230.528147 Yes 2023-01-081.521620 5 8870.138674 Yes 2023-01-09-0.026802 6 9390.005347 No 2023-01-10-0.159331

DisplayDataTypes(df.dtypes)

Thiscommand, df.types(),returnsthedatatypesofeachcolumninthe DataFrame.It’shelpfultounderstandthekindofdata(integers,floats,strings,etc.) eachcolumnholds.

SummaryStatistics(df.describe())

Thiscommand, df.describe(),providesdescriptivestatisticsthatsummarize thecentraltendency,dispersion,andshapeofadataset’sdistribution,excluding NaN values.It’susefulforaquickstatisticaloverview.

1 ABE

2 count 10.00000010.00000010.000000

3 mean 51.5000000.391186-0.413633

4 std 29.9638670.2676981.024197

5 min 6.0000000.005347-2.599667

6 25%27.0000000.189317-0.573874

7 50%55.5000000.390001-0.366408

8 75%79.0000000.524429-0.059934

9 max 87.0000000.8609111.521620

DisplayIndex,Columns,andData(df.info())

Thiscommand, df.info(),providesaconcisesummaryoftheDataFrame,includingthenumberofnon-nullvaluesineachcolumnandthememoryusage.It’s

essentialforinitialdataassessment.

1 <class 'pandas.core.frame.DataFrame'> 2 RangeIndex:10 entries,0 to 9

Datacolumns (total 5 columns): 4 # ColumnNon-NullCountDtype

6 0 A 10 non-null int64

1 B 10 non-null float64

2 C 10 non-null object

3 D 10 non-null datetime64[ns]

4 E 10 non-null float64 11 dtypes: datetime64[ns](1), float64(2), int64(1), object (1)

12 memoryusage:528.0 bytes

DataCleaning

Let’sgothroughthedatacleaningprocessinamoredetailedmanner,stepbystep. WewillstartbycreatingaDataFramethatincludesmissing(NA or null)values, thenapplyvariousdatacleaningoperations,showingboththecommandsused andtheresultingoutputs.

First,wecreateasampleDataFramethatincludessomemissingvalues:

1 import pandasaspd 2 3 #SampleDataFramewithmissingvalues

4 data ={

5 'old_name':[1,2, None,4,5],

6 'B':[10, None,12, None,14],

7 'C':['A' , 'B' , 'C' , 'D' , 'E'],

8 'D': pd.date_range(start = '2023-01-01' , periods =5, freq = 'D'),

9 'E':[20,21,22,23,24] 10 }

11 df = pd.DataFrame(data) 12 df

1 old_nameBCDE

2 01.010.0 A 2023-01-0120

3 12.0 NaNB 2023-01-0221 4 2 NaN 12.0 C 2023-01-0322

5 34.0 NaND 2023-01-0423

6 45.014.0 E 2023-01-0524

ThisDataFramecontainsmissingvaluesincolumns‘old_name’and‘B’.

CheckingforMissingValues

Tofindoutwherethemissingvaluesarelocated,weuse:

1 missing_values = df.isnull().sum() Result: 1 old_name 1 2 B 2

C 0 4 D 0

E 0 6 dtype: int64

FillingMissingValues

Wecanfillmissingvalueswithaspecificvalueoracomputedvalue(likethemean ofthecolumn):

1 filled_df = df.fillna({'old_name':0, 'B': df['B'].mean() })

Result: 1 old_nameBCDE 2 01.010.0 A 2023-01-0120 3 12.012.0 B 2023-01-0221 4 20.012.0 C 2023-01-0322 5 34.012.0 D 2023-01-0423 6 45.014.0 E 2023-01-0524

DroppingMissingValues

Alternatively,wecandroprowswithmissingvalues:

1 dropped_df = df.dropna(axis = 'index')

Result:

1 old_nameBCDE

2 01.010.0 A 2023-01-0120

3 45.014.0 E 2023-01-0524

Wecanalsodropcolumnswithmissingvalues:

1 dropped_df = df.dropna(axis = 'columns')

Result:

1 CDE

2 0 A 2023-01-0120

3 1 B 2023-01-0221

4 2 C 2023-01-0322

5 3 D 2023-01-0423

6 4 E 2023-01-0524

RenamingColumns

Torenamecolumnsforclarityorstandardization:

1 renamed_df = df.rename(columns ={'old_name': 'A'})

Result: 1 ABCDE 2 01.010.0 A 2023-01-0120

5 34.0 NaND 2023-01-0423

6 45.014.0 E 2023-01-0524

DroppingColumns

Toremoveunnecessarycolumns:

1 dropped_columns_df = df.drop(columns =['E']) Result: 1 old_nameBCD 2 01.010.0 A 2023-01-01 3 12.0 NaNB 2023-01-02 4 2

5 34.0 NaND 2023-01-04 6 45.014.0 E 2023-01-05

EachofthesestepsdemonstratesafundamentalaspectofdatacleaninginPandas, crucialforpreparingyourdatasetforfurtheranalysis.

DataTransformation

Datatransformationisacrucialstepinpreparingyourdatasetforanalysis.Pandas providespowerfultoolstotransform,summarize,andcombinedatae iciently. Thischaptercoverskeytechniquessuchasapplyingfunctions,groupingandaggregatingdata,creatingpivottables,andmergingorconcatenatingDataFrames.

ApplyFunction

The apply functionallowsyoutoapplyacustomfunctiontotheDataFrameelements.Thismethodisextremelyflexibleandcanbeappliedtoasinglecolumn ortheentireDataFrame.Here’sanexampleusing apply onasinglecolumnto calculatethesquareofeachvalue:

df['squared']= df['number'].apply(lambda x: x**2)

GroupByandAggregate

Groupingandaggregatingdataareessentialforsummarizingdata.Here’showyou cangroupbyonecolumnandaggregateanothercolumnusing sum:

#Groupbythe 'group' columnandsumthe 'value' column 7 grouped_df = df.groupby('group').agg({'value': 'sum'})

ThefollowingPythonscriptcreatesaDataFramewithdatacategorizedbygroups andtwovaluecolumns.Itthengroupsthedatabythe group columnandapplies di erentstatisticalaggregationfunctionsto value1 and value2.For value1 ,itcalculatesthemeanandstandarddeviation.For value2,itcomputesthe medianandacustommeasurewhichisastringcombiningthemeanandstandard deviation. 1 import pandasaspd

standarddeviationofaseriesformattedasastring

#CreateanewDataFramewithtwocolumns 'value'

={'group':['A' , 'A' , 'B' , 'B' , 'C' , 'C'],

#Groupbythe 'group' columnandapplydifferent aggregationfunctionstoeachcolumn 15 grouped_df = df.groupby('group').agg({

'value1':[('Mean' , 'mean'),('StandardDeviation' , ' std')], #Calculatemeanandrenamedstandard deviationforvalue1

'value2':[('Median' , 'median'),('Measure' , custom_measure)] #Calculatemedianandapply custommeasuretovalue2

PivottablesareusedtosummarizeandreorganizedatainaDataFrame.Here’san exampleofcreatingapivottabletofindthemeanvalues:

1 #SampleDataFrame

2 data ={'category':['A' , 'A' , 'B' , 'B' , 'A'], 3 'value':[100,200,300,400,150]} 4 df = pd.DataFrame(data)

6 #Creatingapivottable 7 pivot_table = df.pivot_table(index = 'category' , values = 'value' , aggfunc = 'mean')

MergeDataFrames

MergingDataFramesisakintoperformingSQLjoins.Here’sanexampleofmerging twoDataFramesonacommoncolumn: 1 #SampleDataFrames 2 data1 ={'id':[1,2,3], 3 'name':['Alice' , 'Bob' , 'Charlie']} 4 df1 = pd.DataFrame(data1) 5 data2 ={'id':[1,2,4], 6 'age':[25,30,35]} 7 df2 = pd.DataFrame(data2)

#Mergingdf1anddf2onthe 'id' column 10 merged_df = pd.merge(df1, df2, on = 'id')

1 idnameage 2 01 Alice 25 3 12 Bob 30

ConcatenateDataFrames

ConcatenatingDataFramesisusefulwhenyouneedtocombinesimilardatafrom di erentsources.Here’showtoconcatenatetwoDataFrames:

1 #SampleDataFrames

2 data3 ={'name':['David' , 'Ella'],

3 'age':[28,22]}

4 df3 = pd.DataFrame(data3)

5

6 #Concatenatingdf2anddf3

7 concatenated_df = pd.concat([df2, df3])

Result:

1 idagename

2 01.025 NaN

3 12.030 NaN

4 24.035 NaN

5 0 NaN 28 David

6 1 NaN 22 Ella

Thesetechniquesprovidearobustframeworkfortransformingdata,allowingyou toprepareandanalyzeyourdatasetsmoree ectively.

DataVisualizationIntegration

Visualizingdataisapowerfulwaytounderstandandcommunicatetheunderlying patternsandrelationshipswithinyourdataset.Pandasintegratesseamlesslywith Matplotlib,acomprehensivelibraryforcreatingstatic,animated,andinteractive visualizationsinPython.ThischapterdemonstrateshowtousePandasforcommon datavisualizations.

Histogram

Histogramsareusedtoplotthedistributionofadataset.Here’showtocreatea histogramfromaDataFramecolumn:

Figure1: Imagegeneratedbytheprovidedcode.

Boxplot

Boxplotsareusefulforvisualizingthedistributionofdatathroughtheirquartiles anddetectingoutliers.Here’showtocreateboxplotsformultiplecolumns:

1 #SampleDataFrame

2 data ={'math_scores':[88,76,90,84,65],

3 'eng_scores':[78,82,88,91,73]}

4 df = pd.DataFrame(data) 5 6 #Creatingaboxplot

7 df.boxplot(column =['math_scores' , 'eng_scores'])

8 plt.title('ScoreDistribution')

9 plt.ylabel('Scores')

10 plt.show()

Figure2: Imagegeneratedbytheprovidedcode.

ScatterPlot

Scatterplotsareidealforexaminingtherelationshipbetweentwonumericvariables.Here’showtocreateascatterplot:

1 #SampleDataFrame

2 data ={'hours_studied':[10,15,8,12,6], 3 'test_score':[95,80,88,90,70]}

4 df = pd.DataFrame(data) 5 6 #Creatingascatterplot

7 df.plot.scatter(x = 'hours_studied' , y = 'test_score' , c = 'DarkBlue')

8 plt.title('TestScorevsHoursStudied')

9 plt.xlabel('HoursStudied')

10 plt.ylabel('TestScore')

11 plt.show()

Figure3: Imagegeneratedbytheprovidedcode.

LinePlot

Lineplotsareusedtovisualizedatapointsconnectedbystraightlinesegments. Thisisparticularlyusefulintimeseriesanalysis: 1 #SampleDataFrame

2 data ={'year':[2010,2011,2012,2013,2014], 3 'sales':[200,220,250,270,300]}

4 df = pd.DataFrame(data) 5 6 #Creatingalineplot

7 df.plot.line(x = 'year' , y = 'sales' , color = 'red')

8 plt.title('YearlySales')

9 plt.xlabel('Year')

10 plt.ylabel('Sales')

11 plt.show()

Figure4: Imagegeneratedbytheprovidedcode.

BarChart

Barchartsareusedtocomparedi erentgroups.Here’sanexampleofabarchart visualizingthecountofvaluesinacolumn:

1 #SampleDataFrame

2 data ={'product':['Apples' , 'Oranges' , 'Bananas' , ' Apples' , 'Oranges' , 'Apples']}

3 df = pd.DataFrame(data) 4

5 #Creatingabarchart

6 df['product'].value_counts().plot.bar(color = 'green')

7 plt.title('ProductFrequency')

8 plt.xlabel('Product')

9 plt.ylabel('Frequency')

10 plt.show()

Figure5: Imagegeneratedbytheprovidedcode.

Eachofthesevisualizationtechniquesprovidesinsightsintodi erentaspectsof yourdata,makingiteasiertoperformcomprehensivedataanalysisandinterpretation.

StatisticalAnalysis

Statisticalanalysisisakeycomponentofdataanalysis,helpingtounderstand trends,relationships,anddistributionsindata.Pandaso ersarangeoffunctions forperformingstatisticalanalyses,whichcanbeincrediblyinsightfulwhenexploringyourdata.Thischapterwillcoverthebasics,includingcorrelation,covariance, andvariouswaysofsummarizingdatadistributions.

CorrelationMatrix

Acorrelationmatrixdisplaysthecorrelationcoe icientsbetweenvariables.Each cellinthetableshowsthecorrelationbetweentwovariables.Here’showtogenerateacorrelationmatrix:

CovarianceMatrix

Thecovariancematrixissimilartoacorrelationmatrixbutshowsthecovariance betweenvariables.Here’showtogenerateacovariancematrix: 1 #Creatingacovariancematrix 2 cov_matrix = df.cov() 3 print(cov_matrix)

ValueCounts

Thisfunctionisusedtocountthenumberofuniqueentriesinacolumn,whichcan beparticularlyusefulforcategoricaldata:

UniqueValuesinColumn

Tofinduniquevaluesinacolumn,usethe unique function.Thiscanhelpidentify thediversityofentriesinacolumn:

1 #Gettinguniquevaluesfromthecolumn

2 unique_values = df['department'].unique()

(unique_values)

NumberofUniqueValues

Ifyouneedtoknowhowmanyuniquevaluesareinacolumn,use nunique:

1 #Countinguniquevalues

2 num_unique_values = df['department'].nunique()

3 print(num_unique_values)

Thesetoolsprovideafundamentalinsightintothestatisticalcharacteristicsofyour data,essentialforbothpreliminarydataexplorationandadvancedanalyses.

IndexingandSelection

E ectivedatamanipulationinPandaso eninvolvespreciseindexingandselection toisolatespecificdatasegments.Thischapterdemonstratesseveralmethodsto selectcolumnsandrowsinaDataFrame,enablingrefineddataanalysis.

SelectColumn

SelectMultipleColumns

Toselectmultiplecolumns,usealistofcolumnnames.Theresultisanew DataFrame:

Youcanselectrowsbasedontheirpositionusing iloc,whichisprimarilyinteger positionbased:

Toselectrowsbylabelindex,use loc,whichuseslabelsintheindex:

1 #Selectingrowsbylabel

2 selected_rows_by_label = df.loc[0:1] 3 print(selected_rows_by_label)

1 nameage 2 0 Alice 25 3 1 Bob 30

ConditionalSelection

Forconditionalselection,useaconditionwithinbracketstofilterdatabasedon columnvalues:

1 #Conditionalselection

2 condition_selected = df[df['age']>30] 3 print(condition_selected) Result:

nameage 2 2 Charlie 35

ThisselectionandindexingfunctionalityinPandasallowsforflexibleande icient datamanipulations,formingthebasisofmanydataoperationsyou’llperform.

DataFormattingandConversion

Datao enneedstobeformattedorconvertedtodi erenttypestomeettherequirementsofvariousanalysistasks.Pandasprovidesversatilecapabilitiesfordata formattingandtypeconversion,allowingfore ectivemanipulationandpreparationofdata.Thischaptercoverssomeessentialoperationsfordataformattingand conversion.

ConvertDataTypes

ChangingthedatatypeofacolumninaDataFrameiso ennecessaryduringdata cleaningandpreparation.Use astype toconvertthedatatypeofacolumn:

={'age':['25' , '30' , '35']}

df['age']= df['age'].astype(int)

(df['age'].dtypes)

'age' columntointeger

StringOperations

PandascanperformvectorizedstringoperationsonSeriesusing .str.Thisis usefulforcleaningandtransformingtextdata:

#SampleDataFrame 2 data ={'name':['Alice' , 'Bob' , 'Charlie']} 3 df = pd.DataFrame(data)

5 #Convertingallnamestolowercase 6 df['name']= df['name'].str.lower() 7 print(df)

DatetimeConversion

Convertingstringsorotherdatetimeformatsintoastandardized datetime64 typeisessentialfortimeseriesanalysis.Use pd.to_datetime toconverta column: 1 #SampleDataFrame 2 data ={'date':['2023-01-01' , '2023-01-02' , '2023-01-03' ]} 3 df = pd.DataFrame(data)

#Converting 'date' columntodatetime 6 df['date']= pd.to_datetime(df['date']) 7 print(df['date'].dtypes) Result: Page42IbonMartínez-Arranz

SettingaspecificcolumnastheindexofaDataFramecanfacilitatefastersearches, betteralignment,andeasieraccesstorows:

Theseformattingandconversiontechniquesarecrucialforpreparingyourdataset fordetailedanalysisandensuringcompatibilityacrossdi erentanalysisandvisualizationtools.

AdvancedDataTransformation

Advanceddatatransformationinvolvessophisticatedtechniquesthathelpinreshaping,restructuring,andsummarizingcomplexdatasets.Thischapterdelves intosomeofthemoreadvancedfunctionsavailableinPandasthatenabledetailed manipulationandtransformationofdata.

LambdaFunctions

Lambdafunctionsprovideaquickande icientwayofapplyinganoperationacross aDataFrame.Here’showyoucanuse apply withalambdafunctiontoincrement everyelementintheDataFrame:

PivotLonger/WiderFormat

The melt functionisusedtotransformdatafromwideformattolongformat, whichcanbemoresuitableforanalysis:

Stack/Unstack

(unstacked)

CrossTabulations

Crosstabulationsareusedtocomputeasimplecross-tabulationoftwo(ormore) factors.Thiscanbeveryusefulinstatisticsandprobabilityanalysis: 1 #Cross-tabulationexample 2 data ={'Gender':['Female' , 'Male' , 'Female' , 'Male'], 3 'Handedness':['Right' , 'Left' , 'Right' , 'Right' ]} 4 df = pd.DataFrame(data)

#Creatingacrosstabulation 7 crosstab = pd.crosstab(df['Gender'], df['Handedness'])

Theseadvancedtransformationsenablesophisticatedhandlingofdatastructures, enhancingtheabilitytoanalyzecomplexdatasetse ectively.

HandlingTimeSeriesData

Timeseriesdataanalysisisacrucialaspectofmanyfieldssuchasfinance,economics,andmeteorology.Pandasprovidesrobusttoolsforworkingwithtime seriesdata,allowingfordetailedanalysisoftime-stampedinformation.Thischapterwillexplorehowtomanipulatetimeseriesdatae ectivelyusingPandas.

SetDatetimeIndex

Settingadatetimeindexisfoundationalintimeseriesanalysisasitfacilitateseasier slicing,aggregation,andresamplingofdata:

3 #SampleDataFramewithdateinformation 4 data ={'date':['2023-01-01' , '2023-01-02' , '2023-01-03' , '2023-01-04'], 5 'value':[100,110,120,130]} 6 df = pd.DataFrame(data)

#Converting 'date' columntodatetimeandsettingitas index

9 df['date']= pd.to_datetime(df['date']) 10 df = df.set_index('date')

print(df)

ResamplingData

Resamplingisapowerfulmethodfortimeseriesdataaggregationordownsampling, whichchangesthefrequencyofyourdata:

1 #Resamplingthedatamonthlyandcalculatingthemean 2 monthly_mean = df.resample('M').mean()

RollingWindowOperations

Rollingwindowoperationsareusefulforsmoothingorcalculatingmovingaverages, whichcanhelpinidentifyingtrendsintimeseriesdata:

1 #Addingmoredatapointsforabetterrollingexample 2 additional_data ={'date': pd.date_range('2023-01-05' , periods =5, freq = 'D'), 3 'value':[140,150,160,170,180]}

4 additional_df = pd.DataFrame(additional_data)

5 df = pd.concat([df, additional_df.set_index('date')])

7 #Calculatingrollingmeanwithawindowof5days

Thesetechniquesareessentialforanalyzingtimeseriesdatae iciently,providing thetoolsneededtohandletrends,seasonality,andothertemporalstructuresin data.

FileExport

Oncedataanalysisiscomplete,itiso ennecessarytoexportdataintovarious formatsforreporting,furtheranalysis,orsharing.Pandasprovidesversatiletools toexportdatatodi erentfileformats,includingCSV,Excel,andSQLdatabases. ThischapterwillcoverhowtoexportDataFramestothesecommonformats.

WritetoCSV

ExportingaDataFrametoaCSVfileisstraightforwardandoneofthemostcommon methodsfordatasharing:

ThisfunctionwillcreateaCSVfilenamed filename.csv inthecurrentdirectory withouttheindexcolumn.

WritetoExcel

ExportingdatatoanExcelfilecanbedoneusingthe to_excel method,which allowsforthestorageofdataalongwithformattingthatcanbeusefulforreports:

1 #WritingtheDataFrametoanExcelfile

2 df.to_excel('filename.xlsx' , index = False) #index= Falsetoavoidwritingrowindices

ThiswillcreateanExcelfile filename.xlsx inthecurrentdirectory.

WritetoSQLDatabase

PandascanalsoexportaDataFramedirectlytoaSQLdatabase,whichisuseful forintegratinganalysisresultsintoapplicationsorstoringdatainacentralized database:

import sqlalchemy 2 3 #CreatingaSQLconnectionengine 4 engine = sqlalchemy.create_engine('sqlite:///example.db') #ExampleusingSQLite 5 6 #WritingtheDataFrametoaSQLdatabase 7 df.to_sql('table_name' , 8 con = engine, 9 index = False, 10 if_exists = 'replace')

The to_sql functionwillcreateanewtablenamed table_name inthespecified SQLdatabaseandwritetheDataFrametothistable.The if_exists='replace ' parameterwillreplacethetableifitalreadyexists;use if_exists='append' toadddatatoanexistingtableinstead.

TheseexportfunctionalitiesenhancetheversatilityofPandas,allowingforseamPage54IbonMartínez-Arranz

lesstransitionsbetweendi erentstagesofdataprocessingandsharing.

AdvancedDataQueries

PerformingadvancedqueriesonaDataFrameallowsforprecisedatafilteringand extraction,whichisessentialfordetailedanalysis.Thischapterexplorestheuse ofthe query functionandthe isin methodforsophisticateddataqueryingin Pandas.

QueryFunction

The

functionallowsyoutofilterrowsbasedonaqueryexpression.It’sa powerfulwaytoselectdatadynamically:

4 4 Eve 45

Thisqueryreturnsallrowswherethe age isgreaterthan30.

Filteringwithisin

The isin methodisusefulforfilteringdatarowswherethecolumnvalueisina predefinedlistofvalues.It’sespeciallyusefulforcategoricaldata:

1 #SampleDataFrame

2 data ={'name':['Alice' , 'Bob' , 'Charlie' , 'David' , 'Eve '],

3 'department':['HR' , 'Finance' , 'IT' , 'HR' , 'IT' ]}

4 df = pd.DataFrame(data)

5

6 #Filteringusingisin

7 filtered_df = df[df['department'].isin(['HR' , 'IT'])] 8 print(filtered_df) Result: 1 namedepartment 2 0 AliceHR 3 2 CharlieIT 4 3 DavidHR 5 4 EveIT

Thisexamplefiltersrowswherethe department columncontainseither‘HR’or ‘IT’.

Theseadvancedqueryingtechniquesenhancetheabilitytoperformtargeteddata analysis,allowingfortheextractionofspecificsegmentsofdatabasedoncomplex criteria.

Multi-IndexOperations

Handlinghigh-dimensionaldatao enrequirestheuseofmulti-levelindexing, orMultiIndex,whichallowsyoutostoreandmanipulatedatawithanarbitrary numberofdimensionsinlower-dimensionaldatastructureslikeDataFrames.This chaptercoverscreatingaMultiIndexandperformingslicingoperationsonsuch structures.

CreatingMultiIndex

MultiIndexingenhancesdataaggregationandgroupingcapabilities.Itallowsfor morecomplexdatamanipulationsandmoresophisticatedanalysis:

SlicingonMultiIndex

SlicingaDataFramewithaMultiIndexinvolvesspecifyingtherangesforeachlevel oftheindex,whichcanbedoneusingthe slice functionorbyspecifyingindex valuesdirectly:

ThisexampledemonstratesslicingtheDataFrametoincludedatafromstates‘CA’

TheseMultiIndexoperationsareessentialforworkingwithcomplexdatastructures e ectively,enablingmorenuanceddataretrievalandmanipulation.

DataMergingTechniques

Mergingdataisafundamentalaspectofmanydataanalysistasks,especiallywhen combininginformationfrommultiplesources.Pandasprovidespowerfulfunctions tomergeDataFramesinamannersimilartoSQLjoins.Thischapterwillcoverfour primarytypesofmerges:outer,inner,le ,andrightjoins.

OuterJoin

Anouterjoinreturnsallrecordswhenthereisamatchineitherthele orright DataFrame.Ifthereisnomatch,themissingsidewillcontain NaN. 1 import pandasaspd

4 data1 ={'column':['A' , 'B' , 'C'], 5 'values1':[1,2,3]}

6 df1 = pd.DataFrame(data1)

7 data2 ={'column':['B' , 'C' , 'D'],

8 'values2':[4,5,6]}

9 df2 = pd.DataFrame(data2)

11 #Performinganouterjoin

12 outer_joined = pd.merge(df1, df2, on = 'column' , how = ' outer')

13 print(outer_joined) Result:

1

InnerJoin

AninnerjoinreturnsrecordsthathavematchingvaluesinbothDataFrames.

1 #Performinganinnerjoin 2 inner_joined =

3 print(inner_joined)

Ale joinreturnsallrecordsfromthele DataFrame,andthematchedrecordsfrom therightDataFrame.Theresultis NaN intherightsidewherethereisnomatch.

1 #Performingaleftjoin

Result:

(left_joined)

1

RightJoin

ArightjoinreturnsallrecordsfromtherightDataFrame,andthematchedrecords fromthele DataFrame.Theresultis NaN inthele sidewherethereisnomatch.

1 #Performingarightjoin 2 right_joined = pd.merge(df1, df2, on = 'column' , how = ' right')

3 print(right_joined)

Thesedatamergingtechniquesarecrucialforcombiningdatafromdi erent sources,allowingformorecomprehensiveanalysesbycreatingaunifieddataset frommultipledisparatesources.

DealingwithDuplicates

Duplicatedatacanskewanalysisandleadtoincorrectconclusions,makingitessentialtoidentifyandhandleduplicatese ectively.Pandasprovidesstraightforward toolstofindandremoveduplicatesinyourdatasets.Thischapterwillguideyou throughtheseprocesses.

FindingDuplicates

The duplicated() functionreturnsabooleanseriesindicatingwhethereach rowisaduplicateofarowthatappearedearlierintheDataFrame.Here’showto useit:

Inthisoutput, True indicatesthattherowisaduplicateofanearlierrowinthe DataFrame.

RemovingDuplicates

ToremovetheduplicaterowsfromtheDataFrame,usethe drop_duplicates () function.Bydefault,thisfunctionkeepsthefirstoccurrenceandremoves subsequentduplicates.

1

Thismethodhasremovedrows3and4,whichwereduplicatesofearlierrows.You canalsocustomizethisbehaviorwiththe keep parameter,whichcanbesetto 'last' tokeepthelastoccurrenceinsteadofthefirst,or False toremoveall duplicatesentirely.

Thesetechniquesareessentialforensuringdataquality,enablingaccurateand reliabledataanalysisbymaintainingonlyuniquedataentriesinyourDataFrame.

CustomOperationswithApply

The apply functioninPandasishighlyversatile,allowingyoutoexecutecustom functionsacrossanentireDataFrameoralongaspecifiedaxis.Thisflexibilitymakes itindispensableforperformingcomplexoperationsthatarenotdirectlysupported bybuilt-inmethods.Thischapterwilldemonstratehowtouse apply forcustom operations.

CustomApplyFunctions

Using apply withalambdafunctionallowsyoutodefineinlinefunctionstoapply toeachroworcolumnofaDataFrame.Hereishowyoucanuseacustomfunction toprocessdatarow-wise:

Inthisexample,the custom_func isappliedtoeachrowoftheDataFrameusing apply.Thefunctioncalculatesanewvaluebasedoncolumns‘col1’and‘col2’for eachrow,andtheresultsarestoredinanewcolumn‘result’.

Thismethodofapplyingcustomfunctionsispowerfulfordatamanipulationand transformation,allowingforoperationsthatgobeyondsimplearithmeticoraggregation.It’sparticularlyusefulwhenyouneedtoperformoperationsthatare specifictoyourdataandnotprovidedbyPandas’built-inmethods.

IntegrationwithMatplotlibfor CustomPlots

Visualizingdataisakeystepindataanalysis,providinginsightsthatarenotapparentfromrawdataalone.PandasintegratessmoothlywithMatplotlib,apopular plottinglibraryinPython,too erversatileoptionsfordatavisualization.This chapterwillshowhowtocreatecustomplotsusingPandasandMatplotlib.

CustomPlotting

Pandas’plottingcapabilitiesarebuiltonMatplotlib,allowingforstraightforward generationofvarioustypesofplotsdirectlyfromDataFrameandSeriesobjects.

LinePlot

Here’showtocreateasimplelineplotdisplayingtrendsoveraseriesofvalues:

import matplotlib.pyplotasplt

#Sampledata

data ={'Year':[2010,2011,2012,2013,2014], 6 'Sales':[100,150,200,250,300]}

df = pd.DataFrame(data)

9 #Plotting

10 df.plot(x = 'Year' , y = 'Sales' , kind = 'line')

11 plt.title('YearlySales')

12 plt.ylabel('Sales')

13 plt.show()

Figure1: Imagegeneratedbytheprovidedcode.

Histogram

Histogramsaregreatforvisualizingthedistributionofnumericaldata:

1 #Sampledata

2 data ={'Grades':[88,92,80,89,90,78,84,76,95, 92]} 3 df = pd.DataFrame(data) 4 5 #Plottingahistogram

6 df['Grades']\ 7 .plot(kind = 'hist' , 8 bins =5, 9 alpha =0.7)

10 plt.title('DistributionofGrades')

11 plt.xlabel('Grades')

12 plt.show()

Figure2: Imagegeneratedbytheprovidedcode.

ScatterPlot

Scatterplotsareusedtoobserverelationshipsbetweenvariables:

1 #Sampledata

2 data ={'Hours':[1,2,3,4,5], 3 'Scores':[77,78,85,93,89]} 4 df = pd.DataFrame(data) 5 6 #Creatingascatterplot

7 df.plot(kind = 'scatter' , x = 'Hours' , y = 'Scores')

8 plt.title('TestScoresbyHoursStudied')

9 plt.xlabel('HoursStudied')

10 plt.ylabel('TestScores')

11 plt.show()

8 x = 'Bars' , 9 y = 'Values' , 10 color = 'blue' , 11 legend = None)

12 plt.title('BarChartExample')

13 plt.ylabel('Values')

14 plt.show()

Figure4: Imagegeneratedbytheprovidedcode.

TheseexamplesillustratehowtointegratePandaswithMatplotlibtocreateinformativeandvisuallyappealingplots.Thisintegrationisvitalforanalyzingtrends, distributions,relationships,andpatternsindatae ectively.

AdvancedGroupingandAggregation

Groupingandaggregatingdataarefundamentaloperationsindataanalysis,especiallywhendealingwithlargeorcomplexdatasets.Pandaso ersadvanced capabilitiesthatallowforsophisticatedgroupingandaggregationstrategies.This chapterexploressomeoftheseadvancedtechniques,includinggroupingbymultiplecolumns,usingmultipleaggregationfunctions,andapplyingtransformation functions.

GroupbyMultipleColumns

Groupingbymultiplecolumnsallowsyoutoperformmoredetailedanalysis.Here’s howtocomputethemeanofgroupsdefinedbymultiplecolumns:

AggregatewithMultipleFunctions

Thisexampledemonstrateshowtonormalizethe‘Revenue’withineach‘Department’,showingdeviationsfromthedepartmentmeanintermsofstandarddeviations.

Theseadvancedgroupingandaggregationtechniquesprovidepowerfultoolsfor breakingdowncomplexdataintomeaningfulsummaries,enablingmorenuanced analysisandinsights.

TextDataSpecificOperations

Textdatao enrequiresspecificprocessingtechniquestoextractmeaningfulinformationortoreformatitforfurtheranalysis.Pandasprovidesarobustsetofstring operationsthatcanbeappliede icientlytoSeriesandDataFrames.Thischapter exploressomeessentialoperationsforhandlingtextdata,includingsearchingfor substrings,splittingstrings,andusingregularexpressions.

StringContains

The contains methodallowsyoutofilterrowsbasedonwhetheracolumn’stext containsaspecifiedsubstring.Thisisusefulforsubsettingdatabasedontextual content:

StringSplit

Splittingstringsintoseparatecomponentscanbeessentialfordatacleaningand preparation.The split methodsplitseachstringintheSeries/Indexbythegiven delimiterandoptionallyexpandstoseparatecolumns:

1 #SplittingtheDescriptioncolumnintowords

2 split_description = df['Description'].str.split('' , expand = True)

3 print(split_description)

Thissplitsthe‘Description’columnintoseparatecolumnsforeachword.

RegularExpressionExtraction

Regularexpressionsareapowerfultoolforextractingpatternsfromtext.The extract methodappliesaregularexpressionpatternandextractsgroupsfrom thefirstmatch:

1 #Extractingthefirstwordwhereitstartswitha

2 #capitalletterfollowedbylowercaseletters

3 extracted_words = df['Description'].str.extract(r'([A-Z][ a-z]+)')

Thisregularexpressionextractsthefirstwordfromeachdescription,whichstarts withacapitalletterandisfollowedbylowercaseletters.

Thesetext-specificoperationsinPandassimplifytheprocessofworkingwithtextual data,allowingfore icientandpowerfulstringmanipulationandanalysis.

WorkingwithJSONandXML

Intoday’sdata-drivenworld,JSON(JavaScriptObjectNotation)andXML(eXtensibleMarkupLanguage)aretwoofthemostcommonformatsusedforstoringand transferringdataontheweb.Pandasprovidesbuilt-infunctionstoeasilyread theseformatsintoDataFrames,facilitatingtheanalysisofstructureddata.This chapterexplainshowtoreadJSONandXMLfilesusingPandas.

JSONisalightweightformatthatiseasyforhumanstoreadandwrite,andeasy formachinestoparseandgenerate.PandascandirectlyreadJSONdataintoa DataFrame:

#ReadingJSONdata 4 df = pd.read_json('filename.json') 5 print(df)

ThismethodwillconvertaJSONfileintoaDataFrame.ThekeysoftheJSONobject willcorrespondtocolumnnames,andthevalueswillformthedataentriesforthe rows.

ReadingXML

XMLisusedforrepresentingdocumentswithastructuredmarkup.Itismore verbosethanJSONbutallowsforamorestructuredhierarchy.Pandascanread XMLdataintoaDataFrame,similartohowitreadsJSON:

1 #ReadingXMLdata

2 df = pd.read_xml('filename.xml')

3 print(df)

ThiswillparseanXMLfileandcreateaDataFrame.ThetagsoftheXMLfilewill typicallydefinethecolumns,andtheirrespectivecontentwillbethedataforthe rows.

Thesefunctionalitiesallowforseamlessintegrationofdatafromwebsourcesand othersystemsthatutilizeJSONorXMLfordatainterchange.ByleveragingPandas’ abilitytoworkwiththeseformats,analystscanfocusmoreonanalyzingthedata ratherthanspendingtimeondatapreparation.

AdvancedFileHandling

Handlingfileswithvariousconfigurationsandformatsisacommonnecessityin dataanalysis.Pandasprovidesextensivecapabilitiesforreadingfromandwriting todi erentfiletypeswithvaryingdelimitersandformats.Thischapterwillexplore readingCSVfileswithspecificdelimitersandwritingDataFramestoJSONfiles.

ReadCSVwithSpecificDelimiter

CSVfilescancomewithdi erentdelimiterslikecommas(,),semicolons(;),or tabs(\t).Pandasallowsyoutospecifythedelimiterwhenreadingthesefiles, whichiscrucialforcorrectlyparsingthedata.

ReadingCSVwithSemicolonDelimiter

ReadingCSVwithTabDelimiter

IftheCSVfileusestabsasdelimiters,here’showyoumightseethefileandread it:

WritingtoJSON

WritingdatatoJSONformatcanbeusefulforwebapplicationsandAPIs.Here’s howtowriteaDataFrametoaJSONfile:

1 #DataFrametowritetoJSON

2 df.to_json('filename.json')

Assuming df containsthepreviousdata,theJSONfile filename.json would looklikethis:

1 {"Name":{"0":"Alice","1":"Bob","2":"Charlie"},"Age":{"0" :30,"1":25,"2":35},"City":{"0":"NewYork","1":"Los Angeles","2":"Chicago"}}

Thisformatisknownas‘column-oriented’JSON.PandasalsosupportsotherJSON orientationswhichcanbespecifiedusingthe orient parameter.

Theseadvancedfilehandlingtechniquesensurethatyoucanworkwithawide rangeoffileformatsandconfigurations,facilitatingdatasharingandintegration acrossdi erentsystemsandapplications.

DealingwithMissingData

Missingdatacansignificantlyimpacttheresultsofyourdataanalysisifnotproperly handled.Pandasprovidesseveralmethodstodealwithmissingvalues,allowing youtoeitherfillthesegapsormakeinterpolationsbasedontheexistingdata. Thischapterexploresmethodslikeinterpolation,forwardfilling,andbackward filling.

InterpolateMissingValues

Interpolationisamethodofestimatingmissingvaluesbyusingotheravailable datapoints.Itisparticularlyusefulintimeseriesdatawherethiscanestimatethe trendsaccurately:

Here, interpolate() linearlyestimatesthemissingvaluesbetweentheexisting numbers.

ForwardFillMissingValues

Forwardfill(ffill)propagatesthelastobservednon-nullvalueforwarduntil anothernon-nullvalueisencountered:

1 #SampleDataFramewithmissingvalues

2 data ={'value':[1, np.nan, np.nan,4,5]}

3 df = pd.DataFrame(data)

4

5 #Applyingforwardfill

6 df['value'].ffill(inplace = True) 7 print(df) Result:

value

BackwardFillMissingValues

Backwardfill(bfill)propagatesthenextobservednon-nullvaluebackwards untilanothernon-nullvalueismet:

Thesemethodsprovideyouwithflexibleoptionsforhandlingmissingdatabasedon thenatureofyourdatasetandthespecificrequirementsofyouranalysis.Correctly addressingmissingdataiscrucialformaintainingtheaccuracyandreliabilityof youranalyticalresults.

DataReshaping

Datareshapingisacrucialaspectofdatapreparationthatinvolvestransformingdatabetweenwideformat(withmorecolumns)andlongformat(withmore rows),dependingontheneedsofyouranalysis.Thischapterdemonstrateshowto reshapedatafromwidetolongformatsandviceversausingPandas.

WidetoLongFormat

The wide_to_long functioninPandasisapowerfultoolfortransformingdata fromwideformattolongformat,whichiso enmoreamenabletoanalysisin Pandas:

ThisoutputrepresentsaDataFrameinlongformatwhereeachrowcorrespondsto asingleyearforeachvariable(AandB)andeachid.

LongtoWideFormat

Convertingdatafromlongtowideformatinvolvescreatingapivottable,which cansimplifycertaintypesofdataanalysisbydisplayingdatawithonevariableper columnandcombinationsofothervariablesperrow:

1 #Assuminglong_dfistheDataFrameinlongformatfrom thepreviousexample 2 #Wewilluseaslightmodificationforclarity 3 long_data ={ 4 'id':[1,1,2,2], 5 'year':[2020,2021,2020,2021], 6 'A':[100,150,200,250], 7 'B':[300,350,400,450]

} 9 long_df = pd.DataFrame(long_data)

#Transformingfromlongtowideformat 12 wide_df = long_df.pivot(index = 'id' , columns = 'year') 13 print(wide_df)

ThisresultdemonstratesaDataFrameinwideformatwhereeach id hasassociated valuesofAandBforeachyearspreadacrossmultiplecolumns.

Reshapingdatae ectivelyallowsforeasieranalysis,particularlywhendealingwith paneldataortimeseriesthatrequireoperationsacrossdi erentdimensions.

CategoricalDataOperations

Categoricaldataiscommoninmanydatasetsinvolvingcategoriesorlabels,such assurveyresponses,producttypes,oruserroles.E icienthandlingofsuchdata canleadtosignificantperformanceimprovementsandeaseofuseindatamanipulationandanalysis.Pandasprovidesrobustsupportforcategoricaldata,including convertingdatatypestocategoricalandspecifyingtheorderofcategories.

ConvertColumntoCategorical

Convertingacolumntoacategoricaltypecanoptimizememoryusageandimprove performance,especiallyforlargedatasets.Here’showtoconvertacolumnto

7 Name: product, dtype: category

8 Categories (3, object):['apple' , 'banana' , 'orange']

Thisshowsthatthe‘product’columnisnowoftype category withthreecategories.

OrderCategories

Sometimes,thenaturalorderofcategoriesmatters(e.g.,inordinaldatasuchas ‘low’,‘medium’,‘high’).Pandasallowsyoutosetandordercategories:

1 #SampleDataFramewithunorderedcategoricaldata

2 data ={'size':['medium' , 'small' , 'large' , 'small' , ' large' , 'medium']}

3 df = pd.DataFrame(data)

4 df['size']= df['size'].astype('category') 5 6 #Settingandorderingcategories

7 df['size']= df['size'].cat.set_categories(['small' , ' medium' , 'large'], ordered = True,)

8 print(df['size'])

7 Name: size, dtype: category

8 Categories (3, object):['small' < 'medium' < 'large']

Thisconversionandorderingprocessensuresthatthe‘size’columnisnotonly categoricalbutalsocorrectlyorderedfrom‘small’to‘large’.

ThesecategoricaldataoperationsinPandasfacilitatethee ectivehandlingofnominalandordinaldata,enhancingbothperformanceandthecapacityformeaningful dataanalysis.

AdvancedIndexing

AdvancedindexingtechniquesinPandasenhancedatamanipulationcapabilities, allowingformoresophisticateddataretrievalandmodificationoperations.This chapterwillfocusonresettingindexes,settingmultipleindexes,andslicingthrough MultiIndexes,whicharecrucialforhandlingcomplexdatasetse ectively.

ResetIndex

ResettingtheindexofaDataFramecanbeusefulwhentheindexneedstobetreated asaregularcolumn,orwhenyouwanttoreverttheindexbacktothedefaultinteger index:

2 039500000 3 119500000 4 221400000

Using drop=True removestheoriginalindexandjustkeepsthedatacolumns.

SetMultipleIndexes

Settingmultiplecolumnsasanindexcanprovidepowerfulwaystoorganizeand selectdata,especiallyusefulinpaneldataorhierarchicaldatasets:

1 #Re-usingpreviousDataFramewithoutresetting

2 df = pd.DataFrame(data)

3

4 #Settingmultiplecolumnsasanindex

5 df.set_index(['state' , 'population'], inplace = True)

6 print(df) Result:

1 EmptyDataFrame

2 Columns:[]

3 Index:[(CA,39500000),(NY,19500000),(FL,21400000)]

TheDataFramenowusesacompositeindexmadeupof‘state’and‘population’.

MultiIndexSlicing

SlicingdatawithaMultiIndexcanbecomplexbutpowerful.The xs method(crosssection)isoneofthemostconvenientwaystoslicemulti-levelindexes:

1 #AssumingtheDataFramewithaMultiIndexfromthe previousexample

2 #Addingsomevaluestodemonstrateslicing

#Slicingwithxs 6 slice_df = df.xs(key = 'CA' , level = 'state')

Thisoperationretrievesallrowsassociatedwith‘CA’fromthe‘state’levelofthe index,showingonlythedataforthepopulationofCalifornia.

Advancedindexingtechniquesprovidenuancedcontroloverdataaccesspatterns inPandas,enhancingdataanalysisandmanipulationcapabilitiesinawiderange ofapplications.

E icientComputations

E icientcomputationiskeyinhandlinglargedatasetsorperformingcomplex operationsrapidly.Pandasincludesfeaturesthatleverageoptimizedcodepaths tospeedupoperationsandreducememoryusage.Thischapterdiscussesusing eval() forarithmeticoperationsandthe query() methodforfiltering,which arebothdesignedtoenhanceperformance.

Useofeval()forE icientOperations

The eval() functioninPandasallowsfortheevaluationofstringexpressions usingDataFramecolumns,whichcanbesignificantlyfaster,especiallyforlarge DataFrames,asitavoidsintermediatedatacopies:

data ={'col1':[1,2,3],

'col2':[4,5,6]}

#Usingeval()toperformefficientoperations 9 df['col3']= df.eval('col1+col2')

2 0145

3 1257 4 2369

Thisexampledemonstrateshowtoaddtwocolumnsusing eval(),whichcanbe fasterthantraditionalmethodsforlargedatasetsduetooptimizedcomputation.

QueryMethodforFiltering

The query() methodallowsyoutofilterDataFramerowsusinganintuitive querystring,whichcanbemorereadableandperformantcomparedtotraditional Booleanindexing:

1 #SampleDataFrame

2 data ={'col1':[10,20,30],

3 'col2':[20,15,25]}

4 df = pd.DataFrame(data)

5

6 #Usingquery()tofilterdata

7 filtered_df = df.query('col1<col2')

8 print(filtered_df)

Result:

1 col1col2 2 01020

Inthisexample, query() filterstheDataFrameforrowswhere‘col1’islessthan ‘col2’.Thismethodcanbeespeciallye icientwhenworkingwithlargeDataFrames, asitutilizesnumexprforfastevaluationofarrayexpressions.

ThesemethodsenhancePandas’performance,makingitapowerfultoolfordata analysis,particularlywhenworkingwithlargeorcomplexdatasets.E icientcomputationsensurethatresourcesareoptimallyused,speedingupdataprocessing andanalysis.

AdvancedDataMerging

Combiningdatasetsisacommonrequirementindataanalysis.Beyondbasic merges,Pandaso ersadvancedtechniquessimilartoSQLoperationsandallows concatenationalongdi erentaxes.ThischapterexploresSQL-likejoinsandvarious concatenationmethodstoe ectivelycombinemultipledatasets.

SQL-likeJoins

SQL-likejoinsinPandasareachievedusingthe merge function.Thismethodis extremelyversatile,allowingforinner,outer,le ,andrightjoins.Here’showto performale join,whichincludesallrecordsfromthele DataFrameandthe matchedrecordsfromtherightDataFrame.Ifthereisnomatch,theresultis NaN onthesideoftherightDataFrame.

1 import pandasaspd 2

3 #SampleDataFrames

4 data1 ={'col':['A' , 'B' , 'C'], 5 'col1':[1,2,3]}

6 df1 = pd.DataFrame(data1)

7 data2 ={'col':['B' , 'C' , 'D'],

8 'col2':[4,5,6]}

9 df2 = pd.DataFrame(data2)

10

11 #Performingaleftjoin

12 left_joined_df = pd.merge(df1, df2, how = 'left' , on = ' col')

13 print(left_joined_df)

Result:

1 colcol1col2 2

Thisresultshowsthatallentriesfrom df1 areincluded,andwherethereare matching‘col’valuesin df2,the‘col2’valuesarealsoincluded.

ConcatenatingAlongaDi erentAxis

Concatenationcanbeperformednotjustvertically(defaultaxis=0),butalsohorizontally(axis=1).Thisisusefulwhenyouwanttoaddnewcolumnstoanexisting DataFrame:

1 #Concatenatingdf1anddf2alongaxis1

2 concatenated_df = pd.concat([df1, df2], axis =1)

3 print(concatenated_df)

Result: 1 colcol1colcol2

ThisresultdemonstratesthattheDataFramesareconcatenatedside-by-side,aligningbyindex.Notethatbecausethe‘col’valuesdonotmatchbetween df1 and df2,theyappeardisjointed,illustratingtheimportanceofindexalignmentinsuch operations.

Theseadvanceddatamergingtechniquesprovidepowerfultoolsfordataintegration,allowingforcomplexmanipulationsandcombinationsofdatasets,muchlike Page108IbonMartínez-Arranz

youwouldaccomplishusingSQLinadatabaseenvironment.

DataQualityChecks

Ensuringdataqualityisacriticalstepinanydataanalysisprocess.Datao en comeswithissueslikemissingvalues,incorrectformats,oroutliers,whichcan significantlyimpactanalysisresults.Pandasprovidestoolstoperformthesechecks e iciently.Thischapterfocusesonusingassertionstovalidatedataquality.

AssertStatementforDataValidation

The assert statementinPythonisane ectivewaytoensurethatcertainconditionsaremetinyourdata.Itisusedtoperformsanitychecksandcanhaltthe programiftheassertionfails,whichishelpfulinidentifyingdataqualityissues earlyinthedataprocessingpipeline.

CheckingforMissingValues

OnecommoncheckistoensurethattherearenomissingvaluesinyourDataFrame. Here’showyoucanusean assert statementtoverifythattherearenomissing valuesacrosstheentireDataFrame:

4 #SampleDataFramewithpossiblemissingvalues

5 data ={'col1':[1,2, np.nan], 'col2':[4, np.nan,6]}

6 df = pd.DataFrame(data)

7

8 #Assertiontocheckformissingvalues

9 try:

10 assertdf.notnull().all().all(), "Therearemissing valuesinthedataframe"

11 except AssertionErrorase: 12 print(e)

IftheDataFramecontainsmissingvalues,theassertionfails,andtheerrormessage “Therearemissingvaluesinthedataframe”isprinted.Ifnomissingvaluesare present,thescriptcontinueswithoutinterruption.

Thismethodofdatavalidationhelpsinenforcingthatdatameetstheexpectedqualitystandardsbeforeproceedingwithfurtheranalysis,thussafeguardingagainst analysisbasedonfaultydata.

Real-WorldCaseStudies:Titanic Dataset

DescriptionoftheData

ThiscodeloadstheTitanicdatasetdirectlyfromapubliclyaccessibleURLinto aPandasDataFrameandprintsthefirstfewentriestogetapreliminaryviewof thedataanditsstructure.The info() functionisthenusedtoprovideaconcise summaryoftheDataFrame,detailingthenon-nullcountanddatatypeforeach column.Thissummaryisinvaluableforquicklyidentifyinganymissingdataand understandingthedatatypespresentineachcolumn,settingthestageforfurther datamanipulationandanalysis.

18

14 alone

non-null bool

dtypes: bool(2), float64(2), int64(4), object(7)

ExploratoryDataAnalysis(EDA)

Thissectiongeneratesstatisticalsummariesfornumericalcolumnsusingdescribe(), whichprovidesaquickoverviewofcentraltendencies,dispersion,andshapeof thedataset’sdistribution.Histogramsandboxplotsareplottedtovisualizethe distributionofanddetectoutliersinnumericaldata.The value_counts() methodgivesacountofuniquevaluesforcategoricalvariables,whichhelpsin understandingthedistributionofcategoricaldata.The pairplot() function fromSeabornshowspairwiserelationshipsinthedataset,coloredbythe‘Survived’ columntoseehowvariablescorrelatewithsurvival. 1 import matplotlib.pyplotasplt

print(titanic.describe())

#Distributionofkeycategoricalfeatures 8 print(titanic['survived'].value_counts())

print(titanic['pclass'].value_counts())

print(titanic['sex'].value_counts())

titanic.hist(bins=10, figsize=(10,7))

plt.show()

titanic.boxplot(column=['age' , 'fare'])

plt.show()

#Pairplottovisualizetherelationshipsbetween numericalvariables 21 sns.pairplot(titanic.dropna(), hue = 'survived')

#Distributionofkeycategoricalfeatures 8 print(titanic['survived'].value_counts())

print(titanic['pclass'].value_counts())

(titanic['sex'].value_counts())

#Distributionofkeycategoricalfeatures 8 print(titanic['survived'].value_counts())

print(titanic['pclass'].value_counts())

print(titanic['sex'].value_counts())

1 #Histogramsfornumericalcolumns

2 titanic.hist(bins=10, figsize=(10,7))

3 plt.show()

1 #Histogramsfornumericalcolumns

2 titanic.hist(bins=10, figsize=(10,7)) 3 plt.show()

Figure1: Imagegeneratedbytheprovidedcode.

1 #Boxplotstocheckforoutliers

2 titanic.boxplot(column=['age' , 'fare'])

3 plt.show()

1 #Boxplotstocheckforoutliers

2 titanic.boxplot(column=['age' , 'fare'])

3 plt.show()

Figure2: Imagegeneratedbytheprovidedcode.

1 #Pairplottovisualizetherelationshipsbetween numericalvariables

2 sns.pairplot(titanic.dropna(), hue = 'survived') 3 plt.show()

1 #Pairplottovisualizetherelationshipsbetween numericalvariables

2 sns.pairplot(titanic.dropna(), hue = 'survived')

3 plt.show()

Figure3: Imagegeneratedbytheprovidedcode.

DataCleaningandPreparation

Thiscodechecksformissingvaluesandhandlesthembyfillingwithmedianvaluesfor Age andthemodefor Embarked.Itconvertscategoricaldata(Sex)intoa numericalformatsuitableformodeling.Columnsthatarenotnecessaryforthe

SurvivalAnalysis

Thissegmentexaminessurvivalratesby class and sex.Ituses groupby() to segmentdatafollowedbymeancalculationstoanalyzesurvivalrates.Resultsare visualizedusingbarplotstoprovideaclearvisualcomparisonofsurvivalrates acrossdi erentgroups.

1 #Groupdatabysurvivalandclass

2 survival_rate = titanic.groupby('pclass')['survived']. mean()

3 print(survival_rate)

1 pclass

2 10.629630

3 20.472826

4 30.242363

5 Name: survived, dtype: float64

1 #Survivalratebysex

2 survival_sex = titanic.groupby('sex')['survived'].mean()

3 print(survival_sex)

1 sex 2 female 0.742038

3 male 0.188908

4 Name: survived, dtype: float64

1 #Visualizationofsurvivalrates

2 sns.barplot(x = 'pclass' , y = 'survived' , data=titanic)

3 plt.title('SurvivalRatesbyClass')

4 plt.show()

Figure4: Imagegeneratedbytheprovidedcode.

1 sns.barplot(x = 'sex' , y = 'survived' , data=titanic)

2 plt.title('SurvivalRatesbySex')

3 plt.show()

Figure5: Imagegeneratedbytheprovidedcode.

ConclusionsandApplications

Thefinalsectionsummarizesthekeyfindingsfromtheanalysis,highlightingthe influenceoffactorslike sex and class onsurvivalrates.Italsodiscusseshowthe techniquesappliedcanbeusedwithotherdatasetstoderiveinsightsandsupport decision-makingprocesses.

1 #Summaryoffindings

2 print("KeyFindingsfromtheTitanicDataset:")

3 print("1.Highersurvivalrateswereobservedamong femalesandupper-classpassengers.")

4 print("2.Ageandfarepricesalsoappearedtoinfluence survivalchances.")

5 6 #Discussiononapplications

7 print("Theseanalysistechniquescanbeappliedtoother datasetstouncoverunderlyingpatternsandimprove decision-making.")

AdditionalResources

ProvidesadditionalresourcesforreaderstoexploremoreaboutPandasanddata analysis.Thisincludeslinkstoo icialdocumentationandtheKagglecompetition pagefortheTitanicdataset,whicho ersaplatformforpracticingandimproving dataanalysisskills.

ThiscomprehensivechapteroutlineandcodeexplanationsgivereadersathoroughunderstandingofdataanalysisworkflowsusingPandas,fromdataloadingto cleaning,analysis,anddrawingconclusions.

1 #ThissectionwouldlistURLsorreferencestofurther reading

2 print("FormoredetailedtutorialsonPandasanddata analysis,visit:")

3 print("-TheofficialPandasdocumentation:https:// pandas.pydata.org/pandas-docs/stable/")

4 print("-Kaggle'sTitanicCompetitionformore explorations:https://www.kaggle.com/c/titanic")

Thischapterprovidesathoroughwalk-throughusingtheTitanicdatasettodemonstratevariousdatahandlingandanalysistechniqueswithPandas,o eringpractical insightsandmethodsthatcanbeappliedtoawiderangeofdataanalysisscenarios.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.
essential-guide-to-pandas by labortoriosrubio - Issuu