捆绑SM社区

ARIA Spotlight: Scarlett Xu

Scarlett Xu's ARIA project:听A Computational Model of Phonotactics

One mechanism for linguistic creativity is the generation of new morphemes via processes such as borrowing (e.g., sudoku), blending (e.g., spork), the pronunciation of acronyms (e.g., laser), and invention from the whole cloth (e.g., derp). The adoption of such morphemes is governed by systematic intuitions about which sequences of sounds are likely and unlikely in a language. The linguistic study of such intuition is an area known as phonotactics. In this project, I worked together with Professor Timothy O鈥橠onnell to explore and implement the major approaches to computational modelling of phonotactic structure. The focal point of the project is the comparison of three classes of models of phonotactics: Maximum Entropy models, Bayesian models, and neural network-based models. Given the scores from all three models, we perform statistical analysis to help us better decide what words are most likely to be discriminated against by all three models. The main objective of the experiment is to collect data that will allow us to compare multiple automatic phonotactic scorers. A good phonotactic scorer should reflect how humans judge the acceptability of a fake word. Thus, the results of the experiment will expose the weaknesses and strengths of various automatic scorers.

The main reason that I am interested in this project is that during taking LING445 and LING483 in my programs, I have been exposed to various models that model linguistics phenomenon computationally and mathematically. It is interesting and exciting to see natural language being modelled by probability theory and computation theories, and the large amount of language data being trained by some machine learning models. With proper modelling of real English words, we are able to quantify the acceptability of nonce English words with given three models.

During the project, I have gained a concrete understanding of the relevant modelling and phonotactic concepts. Moreover, I have acquired relevant technical skills to conduct statistical data analysis and implement models in these classes. Especially, it practices my programming skills of Python and R and consolidates probability theory.

The biggest challenge, which costed most of the research time, was processing all provided data with Python. In particular, in order to compute the representativeness of three scores, I need to collect and aggregate the scores of three models for each word from separate sheets to complete the database of a certain model. Moreover, there are certain words that miss one of the model results but spread across the database are required to be removed. Therefore, I learned to apply some necessary Python libraries related to process CSV files, such as pandas, and relevant EXCEL operations like VLOOKUP. In addition, we encountered the encoding issue of the IPAs which results in displaying the IPAs incorrectly on different devices. I also did some research and fixing the problem by correcting all read and write encoding format to be compatible with IPA in my code during data processing.

ARIA was an excellent opportunity for me to immerse myself in the computational phonotactics area and conduct theoretical and practical research. Not only have I learned related knowledge of phonotactics but also I have experienced the real-life of academic research. Although ARIA only lasted for 11 weeks, this project has not officially ended in our lab. Therefore, in my final year, I will continue working on this project and center my honours thesis paper around this project. The next step will involve selecting representative stimulus and building up the behaviour test experiment to collect human judgements on Amazon Mechanical Turk with the jsPsych library. Moreover, the interest of scientific research will support my choice of graduate studies in preference to going to industry after my undergraduate life.

In the end, I would like to thank the donors of the Arts Student Employment Fund and my supervisor for contributing to this award financially. I would like to thank my supervisor Timothy J. O鈥橠onnell for giving me the opportunity to be a part of this project. His expert advice and support for me and other members on this project were instrumental. I feel honoured to have had the chance to study under him. I would also like to give my appreciation to Vanna Willerton for her assistance on this project.

Back to top