Tutorial¶
“autolabeling” with k-Nearest Neighbors¶
This tutorial will walk you through using hvc
to
automatically label Bengalese finch song with
the k-Nearest Neighbors algorithm.
We call this the “autolabel” workflow
(for more detail after going through this tutorial,
please see autolabel in the workflows: how to work with hvc
section of How-To Guides.)
There’s three main modules in hvc
that you will use in the
autolabel workflow: extract
to extract features, select
to
select a model, and predict
to predict labels for unlabeled data.
The steps below walk you through doing that.
A convenient way to work through this tutorial would be in iPython, so you might first start iPython from the commmand line, like this:
(my-hvc-environment) $ ipython
iPython is not installed automatically with hvc
so you’ll need to
install it. If you’re using the conda
package manager, this is as
easy as:
(my-hvc-environment) $ conda install ipython
First you import
the library so you can work with it.
0. Label a small set of songs to provide training data for the models, typically ~20-40 songs.¶
Here you would label your own song, using your software of choice (evsonganaly, Sound Analysis Pro, Praat) but for this example you can download some data that is already hand labeled from a repository.
1. Pick a machine learning algorithm/model and the features used to train the model.¶
In this case we use the k-Nearest Neighbors (k-NN) algorithm. This algorithm is quick to apply to data but at least one empirical study shows that it does not give the best accuracy on Bengalese finch song. You’ll use the features built into the library that have been tested with k-NN. These features are based in part on those developed by the Troyer lab (http://www.utsa.edu/troyerlab/software.html).
You specify the models and features in a configuration file (“config”
for short). More information about all the parameters in the config file
can be found on the page :ref:writing-extract-config
. For now you
can just copy the text below and save it in some file. The config is
written in YAML, a language for writing data structures (such as
different types of variables in a programming language).
extract:
spect_params:
ref: evsonganaly
segment_params:
threshold: 1500 # arbitrary units of amplitude
min_syl_dur: 0.01 # ms
min_silent_dur: 0.006 # ms
todo_list:
-
bird_ID : gy6or6
file_format: evtaf
feature_group:
- knn
data_dirs:
- .\gy6or6\032612
output_dir: .\gy6or6\
labelset: iabcdefghjk
2. Extract features for that model from song files that will be used to train the model.¶
You call the extract
module and pass it the name of the yaml
config file as an argument. In the example below, the config file was
saved as 'gy6or6_autolabel_example.knn.extract.config.yml'
.
3. Pick the hyperparameters used by the algorithm as it trains the model on the data.¶
Now we use a convenience function to get an estimate of what value for our hyperparameters will give us the best accuracy when we train our machine learning models. The k-Nearest Neighbors algorithm has one main hyperparameter, the number of neighbors \(k\) in feature space that we look at to determine the label for a new syllable we are trying to classify.
4. Train, i.e., fit the model to the data¶
5. Select the best model based on some measure of accuracy.¶
Again we use a config file. In the config file, we specify the name of
the feature file saved by hvc.extract
. Again you can just copy and
paste the text below.
The key things to modify here are the hyperparameter :math:`k` and the name of the feature file. You will choose the value for :math:`k` based on your results from running ``hvc.utils.find_best_k``. You will get the name of the feature file from the directory created when you ran ``hvc.extract``. The name of the directory will be something like ``extract_output_bird_ID_date``. Make sure that on the line that says ``feature_file:``, you paste the name of the feature file after the colon. The name will have a format like ``summary_file_bird_ID_date``.
select:
num_replicates: 10
num_train_samples:
start : 50
stop : 250
step : 50
num_test_samples: 500
models:
-
model_name: knn
feature_list_indices: [0,1,2,3,4,5,6,7,8]
hyperparameters:
k : 4
todo_list:
- #1
feature_file: .\gy6or6\extract_output_171031_214453\summary_feature_file_created_171031_214642
output_dir: .\gy6or6\
Now you can use hvc.select
to select the best model. hvc.select
takes the name of the config file as an argument, which in this example
is gy6or6_autolabel.example.select.knn.config.yml
.
6. Using the fit model, predict labels for unlabeled data.¶
Here you also use a config file.
** The key things to modify here is the model_meta_file
parameter.
hvc.select
will also have created a directory, and for each model it
fit, it will have saved two files, a .model
file and a .meta
file. The .meta
file contains all the metadata that hvc
needs to
be able to use the .model
file. You choose whichever .meta
file
gave you the best results according to the metric you’re using, e.g. the
default of average accuracy across syllable classes. You also need to
specify the directories with unlabeled data, under the data_dirs
section.**
predict:
todo_list:
-
bird_ID : gy6or6
file_format: evtaf
data_dirs:
- C:\Users\Seymour Snyder\Documents\example_song\032612
model_meta_file: .\gy6or6\select_output_171031_215004\knn_k4\knn_200samples_replicate9.meta
output_dir: .\gy6or6
predict_proba: True
convert: notmat
- In a text editor, open
- On the line that says
model_meta_file:
, after the colon, paste the name of a meta file from theselect
output. The name will have a format likesummary_file_bird_ID_date
. - Below the line that says
data_dirs:
, after the dash, add the path to the other folder of data that you downloaded.
Lastly you use the hvc.predict
module to predict labels for new
syllables. hvc.predict
also takes a config file name as an argument.
In this example the file name is
gy6or6_autolabel.example.knn.predict.config.yml
.
parsed predict config
Changing to data directory: C:/Data/gy6or6_all_files/032612
Processing audio file 1 of 39.
Processing audio file 2 of 39.
...
Processing audio file 39 of 39.
predicting labels for features in file: features_from_032612_created_171206_013759
converting to .not.mat files
Congratulations! You have auto-labeled an entire day’s worth of data in just a few minutes!