“You might as well apply, worst comes to worst they say no….” is what I told myself over 2 months ago when I was heading into a meeting to discuss the possibility of an internship at Silverpond. Many hours of coding (and even longer debugging sessions) I am scheduling handover meetings and having a hard time saying goodbye to one of the most welcoming offices I have worked at.
1. Too much data
The project that I picked out (from a whiteboard full of options) was to help improve the existing identification of poachers in the WPS project. And should that be successful make my code available to the rest of the team.
Now improving a model is a very vague term and there is no one single solution to this question. Noon had his sights set on increasing the amount of data used in training runs from publicly available data. A project that could be beneficial for any project especially if the data provided by clients is not sufficient. The project was given the optimistic title of “Data Augmentation”.
As my first action, I proceeded to crash the whole floors internet dashing all hopes I had not to make a complete fool out of myself on my first day. It took weeks for the full Google dataset to be downloaded and most of the testing I had to do on a smaller subset as it would otherwise freeze my local machine.
My finished codes takes an input from all the classes you want images for and then searches the google image dataset for all the corresponding images with those labels and return you a list with their IDs. What we did not anticipate was the sheer volume of annotated images we would be able to pull out for each label. From what was meant to be a few hundred maybe a thousand images added for each object we were hoping to identify we had hundreds of thousands and sometimes over a million hits…
On the bright side, the code worked !!
2. Strings attatched
Having gotten the google image filter up and running I was asked to jump in on a project helping to extract data from file types not ideally suited to storing numerical data, such as pdf and word documents. The initial stages of a pet project to detect kidney disease in cats using ML. Something I learned fairly quickly is that it is the most lengthy part is almost always getting your data into a usable format. So far I had been spoiled by well cleaned and organised astronomical data.
As for the pdf and word document, the former can be processed through several freely available Python tools. The word documents, however, was a different kettle of fish. Due to the included tables and other formatting quirks, it would not read in without causing the most interesting issues such as creating a string that was over 2000 whitespaces. Converting it into a pdf did not help. In desperation, I resorted to converting the file into an image and then tried some character recognition programs on it. Which was an interesting challenge but the document stayed stubborn and refused to relinquish its data to me (in an acceptable format).
Office cat Philipe took a personal interest in this project possible due to the large number of strings I was processing.
The finished project designed by Philipe and myself does extract key features and their corresponding numerical values from pdf and stores them in a data frame. As we were still waiting on a dataset we were limited to only a few sample files from friends and had to leave the project there until we got the complete set released to us.
There is some work still left to be done by the rest of the team which I am passing this code onto. Philipe moved on to a differnt project after developing an interest in robotics.
3. Ticking boxes
Something that unintentionally fell to the wayside was making the google image filter easily accessible. While it was stored on a server where anyone could find it I was still the only person who knows how to run it successfully. This would not do! The most convenient way for everyone to access and filter google images would be through the existing Silverbrane interface.
Even though it took a lot longer then I expected this to take the front end is cleaned up and ready for use. The last thing for me to do is link it all up with my existing python code that runs the image filter and my code will then potentially help a large number of projects for many years to come.
This last section of my project has given me a much greater appreciation of the various websites and their functionality I have been taking for granted for such a long time.