Skip to main content

Clark: Classifying Bird Vocalizations

Clark, M.L., Salas, L., Baligar, S., Quinn, C., Snyder, R.L., Leland, D., Schackwitz, W., Goetz, S.J., Newsam, S. (2023). The effect of soundscape composition on bird vocalization classification in a citizen science biodiversity monitoring project. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2023.102065

There is a need for monitoring biodiversity at multiple spatial and temporal scales to aid conservation efforts. Autonomous recording units (ARUs) can provide cost-effective, long-term and systematic species monitoring data for sound-producing wildlife, including birds, amphibians, insects and mammals over large areas. Modern deep learning can efficiently automate the detection of species occurrences in these sound data with high accuracy. Further, citizen science can be leveraged to scale up the deployment of ARUs and collect reference vocalizations needed for training and validating deep learning models. In this study we develop a convolutional neural network (CNN) acoustic classification pipeline for detecting 54 bird species in Sonoma County, California USA, with sound and reference vocalization data collected by citizen scientists within the Soundscapes to Landscapes project (www.soundscapes2landscapes.org). We trained three ImageNet-based CNN architectures (MobileNetv2, ResNet50v2, ResNet100v2), which function as a Mixture of Experts (MoE), to evaluate the usefulness of several methods to enhance model accuracy. Specifically, we: 1) quantify accuracy with fully-labeled 1-min soundscapes for an assessment of real-world conditions; 2) assess the effect on precision and recall of additional pre-training with an external sound archive (xeno-canto) prior to fine-tuning with vocalization data from our study domain; and, 3) assess how detections and errors are influenced by the presence of coincident biotic and non-biotic sounds (i.e., soundscape components). In evaluating accuracy with soundscape data (n = 37 species) across CNN probability thresholds and models, we found acoustic pre-training followed by fine-tuning improved average precision by 10.3% relative to no pre-training, although there was a small average 0.8% reduction in recall. In selecting an optimal CNN architecture for each species based on maximum F(β = 0.5), we found our MoE approach had total precision of 84.5% and average species precision of 85.1%. Our data exhibit multiple issues arising from applying citizen science and acoustic monitoring at the county scale, including deployment of ARUs with relatively low fidelity and recordings with background noise and overlapping vocalizations. In particular, human noise was significantly associated with more incorrect species detections (false positives, decreased precision), while physical interference (e.g., recorder hit by a branch) and geophony (e.g., wind) was associated with the classifier missing detections (false negatives, decreased recall). Our process surmounted these obstacles, and our final predictions allowed us to demonstrate how deep learning applied to acoustic data from low-cost ARUs paired with citizen science can provide valuable bird diversity data for monitoring and conservation efforts. Read more here