hello and welcome to the m2 nobec show I'm Maria Madison today I'm having a guest on the show for the very first time this is Robert of fukui oh thank you so much for being with us today thank you for having me I'm a big fan of the show can you so a little bit about your background start us off sure so I did my bachelor's in computer engineering at McGill and after that I did a PhD in bioinformatics at Cold Spring Harbor Laboratory I'm very familiar with that place yes if I don't emetics guess that's awesome so what are you doing right now and yeah right now I'm a bond semantics software engineer so my work focuses on building cloud applications to enable biologists to run complex analyses off in an automated fashion and get visualization tools to help them explore the data cool so what did you do during your PhD if you remember remember Tyler your thesis oh boy it was something like elucidating cancer evolution and via single-cell sequencing oh that's also really impressive yeah yeah it always sounds a little pretentious when we do our thesis title I like that does that prepare you well for a bioinformatics software engineering position yeah so interestingly enough into my PhD I started building web applications to act as Chimel some of the algorithms and the software tools I had built into a web app to disseminate them to the general audience that's awesome did you find that that has impact on how many people were using your tools oh absolutely so I found that for example when we go to conference on and present our work people would often come up after and say I've had this data lying around for months now waiting for collaborators to finally have sometimes analyze it and and now I can actually look at the data myself and you know oftentimes they'll say this is very empowering tools to have selects will use the word empowering yes how does it feel to how are other scientists with your work it's pretty powerful it is it's I think it's the best feeling I've had during grad school which is partly why I decided to continue down that route of building usable software for bioinformatics right I really like that because I had a very similar experience partly because I knew you and you know we have all get some talks a lot during our PhDs and everything yeah so like your work was inspiring me in some way that it also helps me to get started with making web applications so on and then of course I got many into visualization tools and that's now what I'm doing with the own genomics company is making a lot more for a biologist so thank you for that oh it's not a lot I mean you're now in a link of your own when it comes to visualization that's so nice of you but I feel like we've also kind of diverge where I've gone more into the visualization part of it that's where in my own research self biggest gap for me was understanding the data kind of to its core understanding the variation the copy number and how they come together and that was kind of where I felt that my research wasn't reaching its full potential that I couldn't quite understand everything that was going on in the cell line the word sequencing and so I started going through that a little more but I think that you've got more into cloud computing now specialize your Johnson industries yeah so my focus has kind of shifted more towards how do you build scalable cloud to optimize their informatics analysis right that's really cool because I feel like a lot of people just to say a few words about cloud Navis and you have so much experience with it there's probably a lot of people who would treat the cloud exactly that can retreat their institutions cluster you know keep it on all the time and just go on there and you know run things when they need to but isn't always beyond so what would you say to that if in my experience that's the wrong approach to take I think you have to think of the cloud as a an on-demand service provider so the best way of the cheapest way to use the cloud is for on-demand kinds of analyses if you here and there need to launch an analysis and once it's done you just stop using the cloud shut it down right and that that really takes advantage of the reason why the cloud so powerful is because you can scale to so much greater love than you can if you do this announced and you can scale in very short amount of time and then you don't have to pay for maintenance you not to pay for cooling and so you don't have hair IT department that manage it right yeah so when we say the cloud here we need and the song AWS and some web service when they believe it stands for and the Google cloud this would be cloud providers it's also Microsoft Azure I don't know if that's how you pronounce it I actually have no baby pager specialist sure I know us in the comments yeah well do not wish you were and so all right let's talk about today and was a little bit about how people weather biologists should use Excel to do some of their data analysis and what are some of the advantages and pitfalls of that when you switch to doing something else so what would you say about whether biologists should use Excel in their research it's funny because when I was preparing for this I thought back to a year ago I think if we had done this interview a year ago I would have said you should never use Excel in biology right I don't feel that way anyone for genes in fact slightly the opposite I feel like yeah it's all has its drawbacks and we can talk about that but I feel like for the most part Excel is a very powerful tool

and telling people you should just use our Python instead is looking over at the fact that learning our Python is not an overnight thing and so I don't learning curve right that's kind of jumping over that like only assume that you're no hotties are exactly you should tell any so a year ago what was the thinking that you had I see some of that in the community as a bit of a stigma where some my own physicians will say that you're doing all of your data analysis in Excel then it's not really bioinformatics so it should never use excels what do you think that comes from right I think this is no true Scotsman argument kind of comes from the idea that people who say that already know aren't Python ok and so to them not using Excel is super easy what do you mean by no true Scotsman Lily with that phrase oh it's essentially an argument style where if you say oh this is not true bioinformatics because you don't do this so basically you're not a yeah you're not up to bioinformatician it you still use Excel right and so similar arguments that we need another field like you're not a true developer if you don't use this or that tool which is obviously nonsensical right you know we should always be using the tools that make sense for research and you know we're always just trying to accomplish some things being pragmatic I think it's always the best idea so there are some pitfalls with Excel still okay so I'll talk about one of those but first at what point do you think that Excel is good enough and when might need to switch over into something like far in order to continue your analysis right so I think Excel is perfectly fine if you're dealing with data that doesn't crash itself okay so if you're especially if you're collecting data or you just received some let's say an HCP matrix of gene expression you just want to plot a few quick things I think that's fine and initial data exploration right they are collecting I definitely look people in the wet lab or maybe doing a time trial that's entering data yeah an excel every time you take a measurement of course you don't need to jump over to are in order to do that you can to know your data down in some format that makes sense and then you can export it from Excel into are actually very cheaply right and yeah that's a really good point so what at what point do you feel like you you have to switch over into our form yes something right so I feel like the data gets too large or if you're doing some complex analysis that excels the support so I don't know if you're doing some very complex mystical analysis obviously you shouldn't be doing in Excel but quite honestly I think probably more than eighty percent of a biologist needs already they're getting into NGS pipelines right that's a whole another piece right yeah so that the problem that I want to talk about with Excel which is that some team names like octave or that converts to date like October for the dekh lunga switch December 1 and this doesn't happen to all of your teams you might sometimes miss it because it's in the middle of the list that no human genome you have twenty thousand one hundred thousand teams depend on how you look at it so there could be very well tipping in your data so what are some of the problems with that and how can you use for how to solve this yeah so one of the biggest problems is that a lot of Eugene Ames now the convergence dates and there as soon as you click Save button they're overwritten and so if you do any kind of downstream analysis you've lost ops or you've lost them Dec 1 and so these genes are not going to show up because probably downstream analysis I'm not going to recognize them if they use any kind of database right so they're matching against database from the UCSC genome browser or something like that then the gene names were no longer match but and you will completely lose whether October was differentially expressed yes which could be very important you right so how do you deal with that issue right so it's fine because if you look at the Excel manual they'll say that there's no way to completely turn off this feature yes pistol' it's not a bug okay but there are some workarounds so one of the workarounds latias is to edit every single gene in your textile and add a single quote before it and so that way Excel is going to leave it alone and not convert it so they're not quote on both sides it's just a single quote like an apostrophe Yeah right before unclear to me why that is yes yeah this is one of the things they suggest and their help page now there's a slightly better way of doing this that doesn't involve you editing every single line good and it's never double-click on a CSV or TSV file and have it open in Excel because then the conversion is done already instead you want to go to file import and then it'll give you a wizard and then when you get to the step where it shows you all the columns in your data you need to select the column that contains genes and select text instead of general and so the general column will get converted to dates to looks like a date but a text will just stay as it and it doesn't have to be that the whole column looks like they prefer to do it will do for each individual Excel yes which is why it's so dangerous because it doesn't say oh there's a lot of date from all of your other piece pretty yeah yeah that's pretty bad so you can do it through adding a single quote before every gene in India and I recommend to bind text for that if you are doing that kind of analysis is you kidding you can highlight all of your columns and have the same care to all the columns that one source really cool if you hold down I think option that's one way to do it I like your way about doing it in the wizard in Excel but I think you mentioned to me that you also have a little application now that helps you do that yeah that's called opt for yes and so like me it's good highlights be the issue that has faith go to staff do so essentially it allows select a bunch of PSD and PSB file and it convert them into Excel files without doing any kind of funky conversion in today's oh cool so that's six care of it okay where can we find that application I can put up

a link here yeah so you can find it on gum right and yeah we put a link to the app and it's totally free so you can give it today Oh awesome so we'll put up a link maybe like right around here and you guys can go check that out so before we wrap up here if anything else you want to talk about and then we can give some links to your website no I think that that covers it you know I think that in general telling biologists don't use Excel just use our Python it's bad advice we should be telling them look Excel is good for a lot of things just be aware of these issues and here's how you fix them mmm that's very true and understanding maybe what kind of sophistical tests Excel can't do yeah Google is your sentence you just you go you search can I do this in Excel yeah and when you get them on something that you can't do instead of saying okay then what maybe I won't do that for my research maybe this project doesn't need a t-test I don't know then you should probably go and find another way to do that and that's when for people who've been using Excel and haven't done anything else I would recommend R as your second step for when it's a little short but you know see whether you can use Excel the whole way through and when you run into the problem that's when you switch okay don't agree with people say absolute like you're not a true by mathematician or you you should never use Excel for doing biology research it just doesn't make it we always have to be pragmatic yeah yeah that's one of my moral codes pragmatism presence all right awesome so where can people find out more about you and get in touch with you on social media oh so you can find me on my website Robert abou cool calm or I'm also on Twitter at Robin Lukoil awesome all right so thank you so much for being on the show and we'll wrap up here I want to say to all of you viewers that you can sign up for email updates from going Genomics calm and objects are calm flash subscribe and that way I'll send you weekly updates when new videos like this one come out I hope to do some more interviews with future because I think this was really good and you can also subscribe on youtube and that also helps other people find these videos and so we can spread this kind of knowledge around the communities I think it's important so thank you so much for watching I'll see you next time on the own genomic show