YAML for data split

less than 1 minute read


Why do we need YAML for data split?

Normally I used a fixed random seed to split dataset. However, I found that my dataset has been slightly and gradually updated with the development of the project.

For instance, in the latter experiments, I found that some patients should be excluded. Then should I re-train all the previous expreiments again? if not, how to ensure the following experiments use the same training/validation/testing data with the previous experiment? (The same seed for different lenth of patient list will lead to very different data split)

So let’s use a YAML file to split dataset so that we can always have the almost same split for training/validation/testing data.

A complete YAML tutorial could be found at Real Python

Difference between YAML, JSON and XML is here

Leave a Comment