spec for yaml config files¶
This document specifies the structure of HVC config files written in yaml. It is a painfully dry document that exists to guide the project code, not to teach someone how to write HVC config files. For a gentle introduction to writing config files, please see the writing_config_files.
Essentially, each config file specifies a list of jobs. Each job in a list will typically correspond to data files from one bird.
Config files consist of three sections:
- global_config: parameters that apply to all jobs
- model_selection: list of jobs for selecting machine learning models
- prediction: list of jobs that apply models to unclassified data
## global_config As the name implies, parameters in the global_config section apply to all jobs. The global_config is a dictionary of dictionaries.
Example: ``` yaml global_config:
- spect_params :
- samp_freq : 32000 # Hz window_size : 512 window_step : 32 freq_cutoffs : [1000,8000]
- neural_net :
- syl_spect_width : 300
model_selection¶
- model_selection is a list of jobs. Each job is a dictionary.
- Hence model_selection is a list of dictionaries.
Each job, i.e. each item in the list, is marked with an empty dash. Below each empty dash appear the keys and values that make up the dictionary.
A job in the ‘model_selection` section must include the following keys:
bird_ID : string, alphanumeric, identifies bird
train : dictionary with parameters for training dataset
test : dictionary with parameters for testing dataset - both train and test contain a list dirs. Each item in dirs
is a string, and that string must be a path to a directory of audio files (expected to contain song from the bird bird_ID).
output_dir : string, directory where output will be saved. HVC
creates a new subfolder in the given directory. - labelset : string, labels used for syllabes. Only syllables with the labels in labelset will be included in the training and testing
datasets.
**If a parameter is defined in global_config and then defined again in a job, the value defined in the job takes precedence over the
global_config value, but only for that job.**
Example: ``` yaml model_selection: # list of dictionaries, dash without key next to is a list item so each dictionary is an item in the list
- # i.e. this is dictionary 1
bird_ID : gr41rd51
- train :
- dirs:
- C:DATAgr41rd51pre_surgery_baseline06-21-12
- test :
- dirs:
- C:DATAgr41rd51pre_surgery_baseline06-19-12
- C:DATAgr41rd51pre_surgery_baseline06-20-12
- C:DATAgr41rd51pre_surgery_baseline06-22-12
output_dir: C:DATAgr41rd51
labelset : iabcdefgjkm
- spect_params : # not required, but will take precedence over spect_params in global_config
samp_freq : 32000 # Hz window_size : 512 window_step : 32 freq_cutoffs : [1000,10000]
prediction¶
Like model_selection, the prediction section is a list of job dictionaries.
- A job in the ‘prediction` section must include the following keys:
- bird_ID : string, alphanumeric, identifies bird
- model_file : string, a file name. Either a scikit-learn model that
has been `pickle`d or `dump`ed by joblib, or an hdf5 model output by Keras.
``` yaml prediction:
- bird_ID : gr41rd51 model_file : gr41rd51_svm.pkl
parameters¶
- The parameters listed below can appear in either global_config or a job.
- spect_params :
- samp_freq : integer
- window_size : integer
- window_step : integer
- freq_cutoffs : list
- num_train_songs :
- start : integer
- stop : integer
- step : integer
- num_train_samples :
- start : integer
- stop : integer
- step : integer
- models :
- knn
- linsvm
- svm
- neural_net