Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use peoples speech dataset? #57

Open
housebaby opened this issue Dec 20, 2021 · 5 comments
Open

how to use peoples speech dataset? #57

housebaby opened this issue Dec 20, 2021 · 5 comments

Comments

@housebaby
Copy link

housebaby commented Dec 20, 2021

I have downloaded the people speech dataset, and have two questions:
image

  1. How to parse the two files?
    image
  2. I notice there are two options: clean / other
    what's the relationship between the two options?
    Do I have to download the data of both the two options?
    So the total audio will be 60k hours?
@xiaobobo-bilibili
Copy link

I have the same problem, apart from clean/ dirty, there is also the difference between CC-BY and CC-BY-SA. I have downloaded all these files but can't decompress them because it doesn't look like zip or tar files.

@will-rice
Copy link

will-rice commented Dec 23, 2021

If you extract what you downloaded from the "Data" button, you can use the manifest to build text/speech pairs based on the label and name keys under the training_data key. One thing I'm confused about is how to access the multilingual data. I don't see a language key in the manifest.

@housebaby
Copy link
Author

I can get label and name from the manifest, but how can I get wavs from the Big File downloaded from "Data"

If you extract what you downloaded from the "Data" button, you can use the manifest to build text/speech pairs based on the label and name keys under the training_data key. One thing I'm confused about is how to access the multilingual data. I don't see a language key in the manifest.

@BuaaAlban
Copy link

BuaaAlban commented Feb 9, 2022

Hi,
I found there were 260988flac files in the released tar dataset(part-00000-07a8f0d3-6d27-4299-887a-dc12a6d72f8d-c000.tar)which only has about 1000 hours, but the metainfo of the json file (part-00000-4e132642-c01c-4db6-9db0-a1e19193f6f8-c000.json)has 4321002 files in it, which mismatch.

Do you have any suggestions or do you find the same problem??

@mozoltov183
Copy link

I have downloaded the people speech dataset, and have two questions: image

  1. How to parse the two files?
    image
  2. I notice there are two options: clean / other
    what's the relationship between the two options?
    Do I have to download the data of both the two options?
    So the total audio will be 60k hours?

Hello, can you share the part-00000-4e132642-c01c-4db6-9db0-a1e19193f6f8-c000.json with me;
i want download it from https://mlcommons.org/en/peoples-speech/ but meet some problems,
and i alrady download part-00000-07a8f0d3-6d27-4299-887a-dc12a6d72f8d-c000.tar, thank u very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants