Page 1



a level general evolution writers 2020 answers 05 05 2




get dissertation abstract on biology for money 05 14

paper mario ttyd help my daddy 05 03


bold writing generator 05 12

looking for research proposal on racism due soon 05 11

egyptian academy for training 05 15

snow report utah ksl cars 05 12

kindle evolution writers white help 05 05

comprehensive analysis outline and presentation paper easel 05 14


for euthanasia persuasive essay 05 03

best essay on water 05 02


need help writing a evolution writers 05 04 2


free house sales reports 05 12

4th grade reading level chapter books 05 10

apa research evolution writers proposal 05 03

free credit report with 3 major bureaus 05 08



essays on the parlor scene in psycho 05 12

thomas kinsella leaving cert essays on love 05 10



bordertown review movie truth 08 19

research essay on rosa parks 05 08

thyroid slide presentations 05 14

compare and contrast essay topics 05 02

imcpedit ios reports 05 10

samsung galaxy note 3 n9006 review of literature 05 13


texas migratory bird report 2018 05 14

globe 8888 keywords 08 18

apa style evolution writers without abstract 05 04


financial reporting council logo 05 12



write critical thinking on sex education for money 05 08

looking for someone to write thesis proposal on criminal offense for me 08 22

apa research evolution writers software 05 04


english paper 2 tips leaving cert 05 03

Fashion Institute of Technology ​Salaamu alaikum in this video i am going to present a machine-learning final project before starting first i would like to give some background and overview of my project so let's start with the project overview alright okay so in 2001 was one of the largest companies in different states but by 2002 it had collapsed into bankruptcy due to widespread corporate fraud so can you imagine that in 2000 which was a big company United States it had collapsed into bankruptcy by 2002 it's a huge impact then in the resulting federal investigation there was a significant amount of typically confidential information entered into public record including tens of thousands of emails and Intel financial data for top execute suits so what I am going to do I am going to deal with the data that was made public by the provider government so in this project what I am going to do I'll build a person of interest identifier based on financial and email data made public as a result of inroads scandal so whatever data that was made public a huge amount of data I am going to work with them and I am going to try to build the person of interest identified so which is also known as Pio I so I will get like 146 executives data from enroll in the fraud case and a person of interest is someone who was inducted for fraud settled with government or testified in exchange for immunity this report documents the machine learning techniques used in building a Pio I identifier so my first task will be to identify the person of interest so this is what I am going to do in my project they're actually main four major steps so I'll go through the step in details I'll go by step by step so it will be clear for us to understand what is happening first what is in row data set why I'll try to look at my data and I will show how all the data is structured so once we understand the data set then we can process it otherwise we won't be able to process the data set my in this case my data set is very huge so I have to have some residue to show my data side then I have feature processing so from these huge data set I have to post my features the features that I want the features that I need I will take those features and process them this is also a very crucial part for implementing our third part which is algorithm so in this case we'll select few algorithms and try to look at how this algorithms work with my data set and finally once we implement the algorithm we'll try to validate it and see the performance of each algorithm which algorithm is giving us the highest accuracy or and what not all right now let's look at our data set first we are going to import the data say but before going to the data set I want to mention that the tools that I have used mainly have these two tools one spy tool another one is ticket 9 so from the city plan we imported some of the packages that we need like 9 pairs like a caduceus course which search CV and some other tools that we need from Python we used image and the plot graph and also pickle so these are the packages that we have imported from these two tools Python and neighbors then if we go and look at our data we'll see like there are like 146 execute if you're interested so if you pin if we print out the data they did few point of the length of the data you know what I know that is it 146 it executives and one thing our data set is actually a Python dictionary so the data set is structured in key and value pair so if we want to see that keys if you want to print out the keys you will see all the names here so the keys are the names that names are structured in this way last name and first name but when we have three names then first name they may personally middle name and last name alright so yeah so these are tower keys and if we want to see what what what is actually for an individual data if we want to print out by Richard B then we will see the features and the values for him in the Enron corporate so he is salary his two messages different premiums to tout de Maynes and more other features I haven't I haven't mentioned everything here because there wasn't enough space so these are the Enron data set so this is how the internet is it is structured alright so now we are gonna we process our data sets I will actually do mainly two things I will take part in out tears and I will check for any missing values so I I would like to first check if there is now play out loud so I take two attributes which is salaries and bonuses on in your employees and see if there is any out there so if I plot this out I can see that is one out layer here so when I take it which what is the value here I see this is for total salary and bonus and this is no sensible information for analysis I can remove it manually so I don't need it so but I get two more outliers which is skilling Jeffery and Lake in it but I have to keep that name in the data set because these values are real and actually these two guys is killing Jeffery and Lake in it they are being suspected in the fraud cause I cannot remove them so I have to keep those data set so first we so here first we removed any outliers before proceeding any further so once we get any outlets we will just pop it out and then we'll remove any non value so we are gonna go from top to bottom and see if we can get in our value and then we'll just remove the null value so after all of this processing our if we plot our if we plot at our data set again then finally we'll get our plot like this so this is the data set that will be c'mon now I will show you how did I do feature posting which which is a really crucial stage in my project so in the

data set after cleaning the out layers I had to pick the most sensible features to use right because I have a lot of features and all the features are not really necessary for my project some of them will give me the correct answer for Pui but for the others they own so I have to pick somehow the relevant features that I need so first I took the two most important features which can give me a pattern one of them is from TOI to dispersal and the other one is from person to Pui but when I scale these two features and didn't get in a strong pattern so what I did I plotted I took actually fractions of both features not the whole and fractions of both pictures of from o to P o I and total from o to messages so these are the two features that I have used alright so here I actually process the features this o code okay so now if I plot my graph I get something like this so this this blues are non piu I and this grade RPO ice so this radar person of interest and his blues are non person of interest okay so but how do i ran the Lang my features how do I know like these features is the best choice for me so in order to do that I we have this strong shape so as we are taught in the class usually is actually very useful to select the features that are relevant so too it will it will give the rank for each attributes and from there I can see which attributes gives me the most better result so when I use decision tree through iterative process and what I get was really good because it gives me so here actually I use decision tree so from here first I import addition to classify and then implement algorithm and finally I get the equation zero point eight and decision clear algorithm time is this but most important in the speech are ranking so these are the features that were ranked by my decision tree so salary bonus and these are the values so it is like sorted so first one is salary then the bonus and the other features so from here I first took this final ten features so finally pick these ten features so distinct features that I could from the sixteen features and I could you support this feature set is around 0.8 so this word accuracy but another problem actually occurred because with this features actually may be shown and recoil were too low because I was I was intending to you do intending to use position recall but it was less than 0.3 so I had to change my strategy and manually pick features which gave me precision and recall values over 0.3 ok in this data set I cannot use accuracy and the other important thing that I forgot to mention earlier in this data set I can of youth actually accuracy or equality my algorithm because there there is a very few POS in that let us say like just 18 and there a best evaluator a position and recall so when I searched I found that that the best position diversity will it will be my position on you call when I have data set like this okay because there were only 18 example so POS in the data set but there were actually 35 people who were viewers in your life but for some reasons the half of those were not placing the data say so after all of this research and study finally pick the following features these three features actually connection from py to email fraction to py email and share received with py so I finally found out that if I use these three features I will get a pattern which can help me to find out my py classifier now we come to the part of algorithm selection and tuning so first we'll try the knife knife as a prediction here we are trying my best we are taking the package for Gaussian neighbors so when we test it with our knife that Kushi was 0.2 algorithm time was on July 2 the second all right so firstly I tried my best accuracy which was lower than decision tree algorithm so our neighbors was 0.8 tree and decision she was 0.9 so I made a calculation that the future said that I'm using it doesn't switch with night bus well enough so finally I select a decision trail we'll know for DUI and 85 it gave me accuracy before tuning parameters 0.9 so this team we haven't tuned the parameters at that time it gave me the quickest is 0.9 no features killing was deployed as well and by the way it is not a hazard you when using a decision tree you don't need to feature desk al alright so after in selecting features Alberto I manually tuned parameter ring samples split so I have to manually do it for our position and recall so in it turned out that this six and five are the best values for means and pose display all right so this way the tuning that I did for my decision tree algorithm after testing with an algorithm which analyzes and analysis the validation and performance so I used three for called meditation it was validated by using 3/4 crossvalidation and precision and recall scores so these are the three things that I used first I used accuracy at first I was actually using a curiosity will let me algorithm but it was a mistake because that I found out that it has a class imbalance problem actually because of why because the number of your eyes is small compared to the total number of examples in the dataset so I have a very a few numbers of person of interest in my dataset compared to my example for other data so I had to use precision and recall for this activities with Jeff I was able to reach a variance value of precision 0.68 and recalled 0.8 let's go through the procedures here so these are the features list as I showed to you like the first one is py the pictures the interaction from py email crashing to py email and share the city provides so these are the four features all right so these are my data set the whole data dictionary here and a machine learning actually goes from here so deploying feature selection and then you scare for the split and validation algorithm so these are the algorithm and then finally we have our decision tree classifier we use decision classifier and then this one is actually before taxi before tuning the accuracy opps and this one is after tuning so yeah so this one is before tuning and this one is after tuning so we are just printing out deck we see so if we find if you go to our output we will see like our accuracy before tuning was zero point eight six six six and decision tree algorithm time of 0.001 done in 0.001 second

and reading algorithm after tuning is the performance the deposition actually increased so it became zero point nine three three and precision was zero point eight Bicol was zero point eight so in the in this way actually we come to the decision and we tuned our algorithm we use decision tree classifier so this was validation and performance for my process we have come to our final part where I will discuss and conclude with our results so after all the process that I have done actually I'm actually finding of the position this one position and recall okay so the precision can be interpreted as the likelihood that a person was identified as a pure I is actually a true py so from the position we can know if our machine learning algorithm is giving us a troop you I the fact that this is 0.68 means that using the identifier to flag you eyes result in the 2% of the positive flags being false alarms all right and then we have the recall visors which actually depicts that differ with a flag a py in the test set so 80 80 percent of the time it would catch the first hole and 20 percent of the time it wouldn't so these numbers are really good but in spite of that you can still improve the strategy one of the possible paths to improvement is digging into emails data mode the email features in this data data said we're aggregated over all the messages for a given person so by digging into the text of each individual message is impossible that more detailed patterns can be found all right so since we are living in a world in which more and more people find ants data might not be easy to find the next realistic thing is to try to a struct mode data from emails so we need to get we need to go deeper and we need to dig deeper into the email text so from there maybe we'll get more patterns to find out that try to find out the real culprits so this was my whole project thanks for thanks for watching the whole thing I hope that I could make it understandable for every wall since this was a video project some maybe some part I could not make it more clear anyway I wanted to say that put the project was really a challenge for me from the live managed to implement the concepts that have been taught in our course I managed to people such the data I managed to feature processing I mean a - - - proposes the feature as well implement machine learning algorithm like I passed and this country and tuned algorithm according to my requirements and finally I was able to validate my performance using cross-validation or I can I can search people called very - or whatnot so finally I'd like to thank when I'm in year 4 hurry for two teachers machine learning those there's also a very beneficial you know project I can implement those concepts project so from here I can extend my knowledge she'll love for partner missionary Bridget thank you New York State College of Human Ecology (HumEc).