Thomas Samuel ('Sam') Spilsbury
Doctoral Student, Aalto University
I'm Sam. I'm an aspiring Computer Science Researcher. My main interests are natural language processing, natural signals processing and machine learning. What I'm most interested in, however, is figuring out how we can use computers to learn more about ourselves. Right now I'm looking at language and its connection to learning to perform tasks.
I am a Doctoral student in Computer Science at Aalto University. I completed my Master Degree at Aalto University in 2020 and my Bachelor of Science at the University of Western Australia in 2016. I also hold a Bachelor of Laws and Bachelor of Arts (Communication Studies) from the same univeristy, completed in 2015 and 2013 respectively.
I wrote my Master Thesis for Curious AI. The thesis topic was 'Neural Model Predictive Control for Industrial Processes'. The thesis shows an end-to-end example of gathering data from the Tennessee Eastman Simulator, fitting appropriate linear and neural models with an encoder-decoder architecture and performing Model Predictive Control using those models. Particular considerations with respect to adversarial behaviour in gradient bassed optimization and countermeasures in the form of input likehood estimation with Denoising Autoencoders are discussed. Also, objective functions to ensure good performance on tasks specified in the original Downs & Vogel paper are derived.
A thesis is not mandatory for a Bachelor degree in Australia, but my main degree project for my Bachelor of Science was using graph databases and web scraping to make sense of retractions on PubMed and my capstone paper for the Bachelor of Arts was on the factors that motivate contributors in successful open source projects.
Research Experience Projects
Ecole Polytechnique Federale de Lausanne
Infogalactic is a fork of Wikipedia. This presents an interesting dataset - is there anything we can see about how a community behaves after a fork when maintaining a large scale resource like Wikipedia? We find that community behaviour can be modelled primarily by topic interest and the underlying graph structure of the existing corpus.
Writing style imitation via combinatorial paraphrasing
Secure Systems Group
Tommi Gröndahl, N. Asokan
Recent advances in stylometry have shown the possibility to identify the author of anonymous writing given a sufficient corpus of writing from that author. This work examines a mechanism to defeat stylometry by running a local instance of the stylometry process in reverse - generating semantically equivalent paraphrases and changes to text, then optimizing paraphrase selection using a much simpler surrogate model to mimic another author. We were able to defeat state-of-the-art classifiers and maintain superior scores in subjective semantic retainment according to NLG metrics and human evaluators.
Machine Learning for Big Data Group
Administrators of educational courses depend on feedback from students in order to determine areas for improvement or change within the course. However, making sense of this feedback once course attendance runs into the hundreds can become intractable. This project looked at ways to cluster unstructured text data, by creating a vector space representations of its features. We came up with ways to generate clusters that human evaluators were much more easily able to annotate compared to a baseline of random clustering.
Thomas Spilsbury, Christabella Irwanto, Dimitrios Papatheodorou (2018)
With the advent of neural-network language models, there are now many ways to generate vector-space representations of words and symbols. There is interest in using these vector-space representations learned as a byproduct of training language models for downstream tasks. This work measures these representations against baselines generated by traditional methods to determine the areas in which language-model generated embeddings are better suited. Working Paper
Dropout for Fully Convolutional Image Segmentation Networks
Thomas Spilsbury, Paavo Camps (2018)
Comparing different dropout methods for fully convolutional image segmentation networks and measuring validation set performance with only 10% of training data available. Working Paper