Katharine Jarmul: Data wrangling with Python

NEW SPEAKER: So the second speaker of today’s section is Katharine Jarmul and works of customer data analysis using Python and today is talking about data wrangling impact.

KATHARINE JARMUL: Hello everybody, nice to be here, I have prepared a talk that is probably after seeing the last hour or so talks a little bit novice for this audience, however hopefully you can use something from it and if not then feel free to use my slides and tell your aunt or cousin or mother how to do data analysis with Python and yes if you want to talk more advanced we can hopefully chat later.

I am Katharine Jarmul, I am at Kjam at most type of tech things; I was originally from Los Angeles, I live in Berlin. I hope everyone is familiar with Pyladies? Yeah? So the original chapter was in Los Angeles and it’s really exciting to see now how much it’s grown, it’s really amazing and makes me feel very warm about the Python community. I’ve been coding Python since 2008, I started with Django at the Washington post when there was still ugly Adrian hats in the {inaudible} it’s a good time yeah we can talk about some of the caching and I am self and mentor taught so I really hope for those of you who are new to Python that you can find and make some connections here today and throughout this next week or that you’ve already made those because I definitely wouldn’t be where I am today if it wasn’t for the mentors that helped me.

So, a little intro to what exactly I mean by data wrangling.

So, it’s basically the ability to analyse as something with data. Everybody here probably has done way more advanced data analysis than this but the really good news is that you can run any type of reports using Python and data analysis so some of what I do currently is running marketing reports and user analysis, site visitor analysis but you can also do sports analysis with statistics and {inaudible} quite a large open data {inaudible} a great tool to use.

Why use Python it’s a scripting language with some real power. There is advanced scientific stack of course which I hope some of us are familiar with. This is a really friendly community. It’s really easy to ask questions and get help. And of course it is named after Monty Python and here in the UK that makes it obviously superior to all other languages.

So why do I even care about Python or data analysis? So the way I approach it especially when I’m talking with people who don’t know at all about code is do you ever have wrote boring awful tasks and would you rather never do them again? So learn Python and that’s what I do. Sadly cannot fold your laundry yet but talk about that sloth balling where do you go, yeah, we’ll see if we can get that working. So again Python allows you to have statistical power without necessarily becoming a statistician. You can easily automate things and you can never ever use excel again and hopefully some of us recognise a nice Welsh dragon burning Microsoft excel!

So the first step of any data wrangling is getting hold of data that’s interesting for you to use. You probably already have some. There is probably something you do regularly whether it’s logging into a utility bills or whatever it might be - you can use that data, another great source is the very large scale open data movement, so data gov UK has quite a lot of good data sets and of course Python supports any number of formats, CD CSV, Executive, PDF, XML, Json, Google docs and I’m sure many of you are aware of supported natively. For those not supported natively there are lots of really useful tools. I recently had the pleasure of using G spread which is a Google spread sheet user which is not the Google spread sheet reader API and it’s quite intuitive and easy to use, XLRD is my favourite if you have to output excel text for people in your company and PDF miner has been something I’ve been recently working with to mine PDF documents.

So data basing. For a quick start especially if let’s say you’re working with people on your team that don’t know SQL yet I would recommend data set, it’s developed by a guy named Puto. It’s not his name but his handle so to speak and he’s part of the news foundation and lives in Berlin and we’re grabbing beers next week I’m helping work on some of the bugs so if you have any things on that send me stuff - there is also relational databases, {inaudible} et cetera and non-relational databases so you can use mongo, couch ...

So APIs application programming interfaces are a great place to get data and you can get Python direct with twitter, Instagram, Facebook and tons of other data sets. Let’s have a look. Yesterday I had the pleasure of sitting in Cardiff and scrolling through twitter and I wanted to search for {inaudible} did I butcher that - who speaks Welsh? Caerdydd ... if my family were here they would be very not proud of me. And I just wanted to see what people were tweeting about, a gorgeous day yesterday in Cardiff so you can see that we have just some tweets that we were able to get about 10 lines of Python code.

Other APIs that you can use, Google analytics. Google adwords, these are things that are really useful if your customer data that you need to integrate say with a back end so say you need to pull all this analytic data in integrated with the back end and say how long are return customers spending on site etc. that can give you more information. Also, plenty of open government APIs, I’ve actually been interfacing quite recently with a lot of the open Africa data looking at some of the conflict mining stories there so there is plenty of data sets there, translation APIs again if you need to do anything like that, stock market APIs recipe APIs, million APIs to interface with.

So a little bit on web scraping, if there is not an API and you need to access the data you can build your own API so to speak with a web scraper so Python is uniquely situated in that it is a scripting language and therefore gives you really easy access to read something like HTML H or X document and then allows you to use the information really quickly with the analysis so I find it’s uniquely situated. I don’t think that’s something necessarily that the community has yet so that’s really nice reason to know Python.

So if you want to take a peek at a page I recommend LXML it’s tremendously fast and has great syntax. If you actually want to click around and use things you can of course use selenium but something recently {inaudible} interact with ghost driver which is a little bit better I think in some ways and then if you really, really need the whole site you can use something like scraping, scraping is tremendously fast and useful and worked on with a great team of developers.

OK so I took some time to just scrape the Django talks for this programme to scrape the content for simple stuff and returns what’s on the page {inaudible} - OK big deal everybody here can read a web-site or use a browser that Laos them to read a web-site so imagine if you couldn’t so for example right now I’m using PowerPoint because I don’t have internet connection so if you don’t have internet connection, if you are travelling, if it can’t be retranslated by Google translate or if you wanted to run data analysis on it this is a good reason to scrape data off the web. I did data analysis on the talks most common words are “the” and “and”, I don’t know what’s up with these words and we can start to see Django “discusses, describes” and there is a lot of “nice, useful” depending on how interested you are in natural language processing there is a lot of useful APIs out there for stripping out things like these site words, occurrence words; average character length for titles is 35 characters and 7.4 per cent of talks mention Cardiff.

So, another thing that Python allows is allows support for “big data” - I don’t really know many people who do require big data but if you do require it you can integrate it with hadoop and pandas numpy are some of my best {inaudible} and anyone not familiar with them it’s great to get Wes Anderson’s book and start working through some of the examples using panda’s you can automate reporting again an allow you to generate reports on the fly or allow you to create normalised generated reports that run every week and allow you to hopefully move some of that front work off your task and then you can run statistical functions, generate graph charts find and remove if you need to outliers or find news outliers, normalise data and perform data clean up so quite a lot of clean up libraries I highly recommend if you haven’t used it before taking a look at fuzzy wuzzy is one of my favourites for if you have to do some language processing and it’s not always the cleanest.

OK and then of course visualization which is the point so you’ve ran all of these things together, the data that you need, you’ve done some statistical analysis and now you’re moving on to visualization, Python has great ones, Bokeh is one I have been playing with, matplotlib is standard and pygal has pretty cool SGE related ones and then the ability to easily share coded charts with iPython notebooks so a client I work with currently there is a lot of non-technical people at the company but they can easily run the code that generates the reports they need they can down load the excel document directly from the notebook and I recommend it for teams that need to interface with teams that are may be a little scared to code Python and I found that over time they become a little bit more accepting to play around with it if you just change the variable we can run it for a different data set and that’s been kind of exciting.

This is bokeh. If you haven’t played with it. Pretty cool visualization just from the grailer(?).

So if you want to know more I know this one lady who writes about Python so have a book coming out with another one of the folks that I worked with at the Washington post it’s an O’Riley book and yes if you get it today there is a free Brains pint in it for you!

OK so you can ask me questions now or you can ask me questions later, so I will be here sadly until Wednesday when I have to get back because I have an intensive German course in Berlin but yeah feel free to reach out and thanks so much for listening.

{Applause}.

DANIELE PROCIDA: Thank you very much. So, do we have any questions from our audience?

NEW SPEAKER: What is the hardest data process you’ve probably faced?

KATHARINE JARMUL: 100 per cent data clean up it’s the biggest pain I don’t think anyone really likes it but I think that there is quite a lot of powerful tools down for it but I find it to be still the most manual of processes. If it’s clean data and I can import it from a database or a clean source that’s great I can immediately start using pandas or whatever I feel like using that day but yeah clean up particularly when there is no normalisation of data say like a matching non normalised strings things like that it is just kind of one of those - so may be one day we’ll solve that problem, I don’t know how but have a pint of Brains and talk about it.

NEW SPEAKER: Do you have a favourite toolkit you use?

KATHARINE JARMUL: Yes I mean the NLTP toolkit the standard one is the one I’ve most played around with but I think there probably will eventually be one that’s maybe somewhere in-between that having to have that entire stack and having to learn so much about NLTP but allowing people to kind of use some of those tools within just a small library, I think fuzzy wuzzy is useful for using that talk analysis.

NEW SPEAKER: Do you know of any way to get data from say film files?

KATHARINE JARMUL: From film files?

NEW SPEAKER: Yes it’s made of audio then many frames -

KATHARINE JARMUL: Does anybody have any ideas? I haven’t worked with film before.

NEW SPEAKER: FF {inaudible}.

KATHARINE JARMUL: ?? ... Yeah?

NEW SPEAKER: ... Testing your analysis against a known data set? Making sure you run the same {inaudible} reproduces design output always.

KATHARINE JARMUL: That’s pretty essential. I think one of the problems that you run into with this and some of why it can’t always be tested is that you have to take into account say your handling a standard deviations or out liars right and do you have a normalised or non-normalised data set and that’s one of the hardest things is figuring out OK identifying is your data set normally distributed or not and maybe taking different paths depending on that so I think like testing your data a little bit first and getting toe it is an essential first step before you decide OK this is the report I can use with it. If not you are going to find your reports become really skewed because of one particular outlier or a few outliers, I think that’s essential and needs to be done more often and I think that determining different pathway depending on data distribution is another key part of that. Thanks so much. {Applause}.

DANIELE PROCIDA: Thank you.

Okay a little announcement, two little announcements whatever you have heard from our website or written down in your programme booklet or in a hand-out or anything else, I am telling you now the correct time for being at dinner tonight. Ignore anything else, aim to be at whichever venue whatever it is, if you have a ticket for the Vegetarian Studio or the Clink at 7:15. From here it takes about, you can amble to either of those destinations in about 20 minutes. The aim to start eating by 7:30, if we aim to be there by 7:15 that will be helpful. If there are sponsors who would like to get their stuff moved to City Hall ready for tomorrow, we will put it in the van and take it down to City Hall which is a short while.

ADRIENNE LOWE: If we purchase hello web app book, Tracey Osborne’s book, deliver it to the registration desk.

FROM THE FLOOR: When does the ...

DANIELE PROCIDA: We will put a board up for you to sign tomorrow. There was another thing to say, it slipped my mind, I am sure it was really important.

Oh if anybody fancies has a spare pair of hands to help put any of the conference stuff into the back of the van when the talks are over, that will be very handy.

Ah yes, so you should already have your tickets, should have a printed out ticket, don’t worry if you don’t have the printed out ticket, but either have purchased the ticket or had the ticket from us in one form or another. If that is not the case, and you expected to be at one of the restaurants, see me up in my quality room office thing. You can still buy tickets for the VFS they are £15 for a good vegetarian meal., all the Django Girls ... you will have a chance to go the different restaurants.

So, yes?