Created using Colab

This commit is contained in:
capcomm 2024-04-23 12:20:02 -04:00
parent f71c4f38c3
commit 244671a330
1 changed files with 411 additions and 0 deletions

411
whisper_youtube.ipynb Normal file
View File

@ -0,0 +1,411 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/capcomm/WordPress/blob/master/whisper_youtube.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# **Youtube Videos Transcription with OpenAI's Whisper**\n",
"\n",
"[![blog post shield](https://img.shields.io/static/v1?label=&message=Blog%20post&color=blue&style=for-the-badge&logo=openai&link=https://openai.com/blog/whisper)](https://openai.com/blog/whisper)\n",
"[![notebook shield](https://img.shields.io/static/v1?label=&message=Notebook&color=blue&style=for-the-badge&logo=googlecolab&link=https://colab.research.google.com/github/ArthurFDLR/whisper-youtube/blob/main/whisper_youtube.ipynb)](https://colab.research.google.com/github/ArthurFDLR/whisper-youtube/blob/main/whisper_youtube.ipynb)\n",
"[![repository shield](https://img.shields.io/static/v1?label=&message=Repository&color=blue&style=for-the-badge&logo=github&link=https://github.com/openai/whisper)](https://github.com/openai/whisper)\n",
"[![paper shield](https://img.shields.io/static/v1?label=&message=Paper&color=blue&style=for-the-badge&link=https://cdn.openai.com/papers/whisper.pdf)](https://cdn.openai.com/papers/whisper.pdf)\n",
"[![model card shield](https://img.shields.io/static/v1?label=&message=Model%20card&color=blue&style=for-the-badge&link=https://github.com/openai/whisper/blob/main/model-card.md)](https://github.com/openai/whisper/blob/main/model-card.md)\n",
"\n",
"Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.\n",
"\n",
"This Notebook will guide you through the transcription of a Youtube video using Whisper. You'll be able to explore most inference parameters or use the Notebook as-is to store the transcript and video audio in your Google Drive."
],
"metadata": {
"id": "96kvih9mXkNN"
}
},
{
"cell_type": "code",
"source": [
"#@markdown # **Check GPU type** 🕵️\n",
"\n",
"#@markdown The type of GPU you get assigned in your Colab session defined the speed at which the video will be transcribed.\n",
"#@markdown The higher the number of floating point operations per second (FLOPS), the faster the transcription.\n",
"#@markdown But even the least powerful GPU available in Colab is able to run any Whisper model.\n",
"#@markdown Make sure you've selected `GPU` as hardware accelerator for the Notebook (Runtime &rarr; Change runtime type &rarr; Hardware accelerator).\n",
"\n",
"#@markdown | GPU | GPU RAM | FP32 teraFLOPS | Availability |\n",
"#@markdown |:------:|:----------:|:--------------:|:------------------:|\n",
"#@markdown | T4 | 16 GB | 8.1 | Free |\n",
"#@markdown | P100 | 16 GB | 10.6 | Colab Pro |\n",
"#@markdown | V100 | 16 GB | 15.7 | Colab Pro (Rare) |\n",
"\n",
"#@markdown ---\n",
"#@markdown **Factory reset your Notebook's runtime if you want to get assigned a new GPU.**\n",
"\n",
"!nvidia-smi -L\n",
"\n",
"!nvidia-smi"
],
"metadata": {
"id": "QshUbLqpX7L4"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IfG0E_WbRFI0",
"cellView": "form"
},
"outputs": [],
"source": [
"#@markdown # **Install libraries** 🏗️\n",
"#@markdown This cell will take a little while to download several libraries, including Whisper.\n",
"\n",
"#@markdown ---\n",
"\n",
"! pip install git+https://github.com/openai/whisper.git\n",
"! pip install yt-dlp\n",
"\n",
"import sys\n",
"import warnings\n",
"import whisper\n",
"from pathlib import Path\n",
"import yt_dlp\n",
"import subprocess\n",
"import torch\n",
"import shutil\n",
"import numpy as np\n",
"from IPython.display import display, Markdown, YouTubeVideo\n",
"\n",
"device = torch.device('cuda:0')\n",
"print('Using device:', device, file=sys.stderr)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1zwGAsr4sIgd",
"cellView": "form"
},
"outputs": [],
"source": [
"#@markdown # **Optional:** Save data in Google Drive 💾\n",
"#@markdown Enter a Google Drive path and run this cell if you want to store the results inside Google Drive.\n",
"\n",
"# Uncomment to copy generated images to drive, faster than downloading directly from colab in my experience.\n",
"from google.colab import drive\n",
"drive_mount_path = Path(\"/\") / \"content\" / \"drive\"\n",
"drive.mount(str(drive_mount_path))\n",
"drive_mount_path /= \"My Drive\"\n",
"#@markdown ---\n",
"drive_path = \"Colab Notebooks/Whisper Youtube\" #@param {type:\"string\"}\n",
"#@markdown ---\n",
"#@markdown **Run this cell again if you change your Google Drive path.**\n",
"\n",
"drive_whisper_path = drive_mount_path / Path(drive_path.lstrip(\"/\"))\n",
"drive_whisper_path.mkdir(parents=True, exist_ok=True)"
]
},
{
"cell_type": "code",
"source": [
"#@markdown # **Model selection** 🧠\n",
"\n",
"#@markdown As of the first public release, there are 4 pre-trained options to play with:\n",
"\n",
"#@markdown | Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |\n",
"#@markdown |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|\n",
"#@markdown | tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x |\n",
"#@markdown | base | 74 M | `base.en` | `base` | ~1 GB | ~16x |\n",
"#@markdown | small | 244 M | `small.en` | `small` | ~2 GB | ~6x |\n",
"#@markdown | medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x |\n",
"#@markdown | large | 1550 M | N/A | `large` | ~10 GB | 1x |\n",
"\n",
"#@markdown ---\n",
"Model = 'medium' #@param ['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large']\n",
"#@markdown ---\n",
"#@markdown **Run this cell again if you change the model.**\n",
"\n",
"whisper_model = whisper.load_model(Model)\n",
"\n",
"if Model in whisper.available_models():\n",
" display(Markdown(\n",
" f\"**{Model} model is selected.**\"\n",
" ))\n",
"else:\n",
" display(Markdown(\n",
" f\"**{Model} model is no longer available.**<br /> Please select one of the following:<br /> - {'<br /> - '.join(whisper.available_models())}\"\n",
" ))"
],
"metadata": {
"cellView": "form",
"id": "TMhrSq_GZ6kA"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#@markdown # **Video selection** 📺\n",
"\n",
"#@markdown Enter the URL of the Youtube video you want to transcribe, wether you want to save the audio file in your Google Drive, and run the cell.\n",
"\n",
"Type = \"Youtube video or playlist\" #@param ['Youtube video or playlist', 'Google Drive']\n",
"#@markdown ---\n",
"#@markdown #### **Youtube video or playlist**\n",
"URL = \"https://youtu.be/L_Guz73e6fw\" #@param {type:\"string\"}\n",
"# store_audio = True #@param {type:\"boolean\"}\n",
"#@markdown ---\n",
"#@markdown #### **Google Drive video, audio (mp4, wav), or folder containing video and/or audio files**\n",
"video_path = \"Colab Notebooks/transcription/my_video.mp4\" #@param {type:\"string\"}\n",
"#@markdown ---\n",
"#@markdown **Run this cell again if you change the video.**\n",
"\n",
"video_path_local_list = []\n",
"\n",
"if Type == \"Youtube video or playlist\":\n",
"\n",
" ydl_opts = {\n",
" 'format': 'm4a/bestaudio/best',\n",
" 'outtmpl': '%(id)s.%(ext)s',\n",
" # See help(yt_dlp.postprocessor) for a list of available Postprocessors and their arguments\n",
" 'postprocessors': [{ # Extract audio using ffmpeg\n",
" 'key': 'FFmpegExtractAudio',\n",
" 'preferredcodec': 'wav',\n",
" }]\n",
" }\n",
"\n",
" with yt_dlp.YoutubeDL(ydl_opts) as ydl:\n",
" error_code = ydl.download([URL])\n",
" list_video_info = [ydl.extract_info(URL, download=False)]\n",
"\n",
" for video_info in list_video_info:\n",
" video_path_local_list.append(Path(f\"{video_info['id']}.wav\"))\n",
"\n",
"elif Type == \"Google Drive\":\n",
" # video_path_drive = drive_mount_path / Path(video_path.lstrip(\"/\"))\n",
" video_path = drive_mount_path / Path(video_path.lstrip(\"/\"))\n",
" if video_path.is_dir():\n",
" for video_path_drive in video_path.glob(\"**/*\"):\n",
" if video_path_drive.is_file():\n",
" display(Markdown(f\"**{str(video_path_drive)} selected for transcription.**\"))\n",
" elif video_path_drive.is_dir():\n",
" display(Markdown(f\"**Subfolders not supported.**\"))\n",
" else:\n",
" display(Markdown(f\"**{str(video_path_drive)} does not exist, skipping.**\"))\n",
" video_path_local = Path(\".\").resolve() / (video_path_drive.name)\n",
" shutil.copy(video_path_drive, video_path_local)\n",
" video_path_local_list.append(video_path_local)\n",
" elif video_path.is_file():\n",
" video_path_local = Path(\".\").resolve() / (video_path.name)\n",
" shutil.copy(video_path, video_path_local)\n",
" video_path_local_list.append(video_path_local)\n",
" display(Markdown(f\"**{str(video_path)} selected for transcription.**\"))\n",
" else:\n",
" display(Markdown(f\"**{str(video_path)} does not exist.**\"))\n",
"\n",
"else:\n",
" raise(TypeError(\"Please select supported input type.\"))\n",
"\n",
"for video_path_local in video_path_local_list:\n",
" if video_path_local.suffix == \".mp4\":\n",
" video_path_local = video_path_local.with_suffix(\".wav\")\n",
" result = subprocess.run([\"ffmpeg\", \"-i\", str(video_path_local.with_suffix(\".mp4\")), \"-vn\", \"-acodec\", \"pcm_s16le\", \"-ar\", \"16000\", \"-ac\", \"1\", str(video_path_local)])\n"
],
"metadata": {
"id": "xYLPZQX9S7tU",
"cellView": "form"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-X0qB9JAzMLY",
"cellView": "form",
"collapsed": true
},
"outputs": [],
"source": [
"#@markdown # **Run the model** 🚀\n",
"\n",
"#@markdown Run this cell to execute the transcription of the video. This can take a while and very based on the length of the video and the number of parameters of the model selected above.\n",
"\n",
"#@markdown ## **Parameters** ⚙️\n",
"\n",
"#@markdown ### **Behavior control**\n",
"#@markdown ---\n",
"language = \"English\" #@param ['Auto detection', 'Afrikaans', 'Albanian', 'Amharic', 'Arabic', 'Armenian', 'Assamese', 'Azerbaijani', 'Bashkir', 'Basque', 'Belarusian', 'Bengali', 'Bosnian', 'Breton', 'Bulgarian', 'Burmese', 'Castilian', 'Catalan', 'Chinese', 'Croatian', 'Czech', 'Danish', 'Dutch', 'English', 'Estonian', 'Faroese', 'Finnish', 'Flemish', 'French', 'Galician', 'Georgian', 'German', 'Greek', 'Gujarati', 'Haitian', 'Haitian Creole', 'Hausa', 'Hawaiian', 'Hebrew', 'Hindi', 'Hungarian', 'Icelandic', 'Indonesian', 'Italian', 'Japanese', 'Javanese', 'Kannada', 'Kazakh', 'Khmer', 'Korean', 'Lao', 'Latin', 'Latvian', 'Letzeburgesch', 'Lingala', 'Lithuanian', 'Luxembourgish', 'Macedonian', 'Malagasy', 'Malay', 'Malayalam', 'Maltese', 'Maori', 'Marathi', 'Moldavian', 'Moldovan', 'Mongolian', 'Myanmar', 'Nepali', 'Norwegian', 'Nynorsk', 'Occitan', 'Panjabi', 'Pashto', 'Persian', 'Polish', 'Portuguese', 'Punjabi', 'Pushto', 'Romanian', 'Russian', 'Sanskrit', 'Serbian', 'Shona', 'Sindhi', 'Sinhala', 'Sinhalese', 'Slovak', 'Slovenian', 'Somali', 'Spanish', 'Sundanese', 'Swahili', 'Swedish', 'Tagalog', 'Tajik', 'Tamil', 'Tatar', 'Telugu', 'Thai', 'Tibetan', 'Turkish', 'Turkmen', 'Ukrainian', 'Urdu', 'Uzbek', 'Valencian', 'Vietnamese', 'Welsh', 'Yiddish', 'Yoruba']\n",
"#@markdown > Language spoken in the audio, use `Auto detection` to let Whisper detect the language.\n",
"#@markdown ---\n",
"verbose = 'Live transcription' #@param ['Live transcription', 'Progress bar', 'None']\n",
"#@markdown > Whether to print out the progress and debug messages.\n",
"#@markdown ---\n",
"output_format = 'all' #@param ['txt', 'vtt', 'srt', 'tsv', 'json', 'all']\n",
"#@markdown > Type of file to generate to record the transcription.\n",
"#@markdown ---\n",
"task = 'transcribe' #@param ['transcribe', 'translate']\n",
"#@markdown > Whether to perform X->X speech recognition (`transcribe`) or X->English translation (`translate`).\n",
"#@markdown ---\n",
"\n",
"#@markdown <br/>\n",
"\n",
"#@markdown ### **Optional: Fine tunning**\n",
"#@markdown ---\n",
"temperature = 0.15 #@param {type:\"slider\", min:0, max:1, step:0.05}\n",
"#@markdown > Temperature to use for sampling.\n",
"#@markdown ---\n",
"temperature_increment_on_fallback = 0.2 #@param {type:\"slider\", min:0, max:1, step:0.05}\n",
"#@markdown > Temperature to increase when falling back when the decoding fails to meet either of the thresholds below.\n",
"#@markdown ---\n",
"best_of = 5 #@param {type:\"integer\"}\n",
"#@markdown > Number of candidates when sampling with non-zero temperature.\n",
"#@markdown ---\n",
"beam_size = 8 #@param {type:\"integer\"}\n",
"#@markdown > Number of beams in beam search, only applicable when temperature is zero.\n",
"#@markdown ---\n",
"patience = 1.0 #@param {type:\"number\"}\n",
"#@markdown > Optional patience value to use in beam decoding, as in [*Beam Decoding with Controlled Patience*](https://arxiv.org/abs/2204.05424), the default (1.0) is equivalent to conventional beam search.\n",
"#@markdown ---\n",
"length_penalty = -0.05 #@param {type:\"slider\", min:-0.05, max:1, step:0.05}\n",
"#@markdown > Optional token length penalty coefficient (alpha) as in [*Google's Neural Machine Translation System*](https://arxiv.org/abs/1609.08144), set to negative value to uses simple length normalization.\n",
"#@markdown ---\n",
"suppress_tokens = \"-1\" #@param {type:\"string\"}\n",
"#@markdown > Comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations.\n",
"#@markdown ---\n",
"initial_prompt = \"\" #@param {type:\"string\"}\n",
"#@markdown > Optional text to provide as a prompt for the first window.\n",
"#@markdown ---\n",
"condition_on_previous_text = True #@param {type:\"boolean\"}\n",
"#@markdown > if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop.\n",
"#@markdown ---\n",
"fp16 = True #@param {type:\"boolean\"}\n",
"#@markdown > whether to perform inference in fp16.\n",
"#@markdown ---\n",
"compression_ratio_threshold = 2.4 #@param {type:\"number\"}\n",
"#@markdown > If the gzip compression ratio is higher than this value, treat the decoding as failed.\n",
"#@markdown ---\n",
"logprob_threshold = -1.0 #@param {type:\"number\"}\n",
"#@markdown > If the average log probability is lower than this value, treat the decoding as failed.\n",
"#@markdown ---\n",
"no_speech_threshold = 0.6 #@param {type:\"slider\", min:-0.0, max:1, step:0.05}\n",
"#@markdown > If the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence.\n",
"#@markdown ---\n",
"\n",
"verbose_lut = {\n",
" 'Live transcription': True,\n",
" 'Progress bar': False,\n",
" 'None': None\n",
"}\n",
"\n",
"args = dict(\n",
" language = (None if language == \"Auto detection\" else language),\n",
" verbose = verbose_lut[verbose],\n",
" task = task,\n",
" temperature = temperature,\n",
" temperature_increment_on_fallback = temperature_increment_on_fallback,\n",
" best_of = best_of,\n",
" beam_size = beam_size,\n",
" patience=patience,\n",
" length_penalty=(length_penalty if length_penalty>=0.0 else None),\n",
" suppress_tokens=suppress_tokens,\n",
" initial_prompt=(None if not initial_prompt else initial_prompt),\n",
" condition_on_previous_text=condition_on_previous_text,\n",
" fp16=fp16,\n",
" compression_ratio_threshold=compression_ratio_threshold,\n",
" logprob_threshold=logprob_threshold,\n",
" no_speech_threshold=no_speech_threshold\n",
")\n",
"\n",
"temperature = args.pop(\"temperature\")\n",
"temperature_increment_on_fallback = args.pop(\"temperature_increment_on_fallback\")\n",
"if temperature_increment_on_fallback is not None:\n",
" temperature = tuple(np.arange(temperature, 1.0 + 1e-6, temperature_increment_on_fallback))\n",
"else:\n",
" temperature = [temperature]\n",
"\n",
"if Model.endswith(\".en\") and args[\"language\"] not in {\"en\", \"English\"}:\n",
" warnings.warn(f\"{Model} is an English-only model but receipted '{args['language']}'; using English instead.\")\n",
" args[\"language\"] = \"en\"\n",
"\n",
"for video_path_local in video_path_local_list:\n",
" display(Markdown(f\"### {video_path_local}\"))\n",
"\n",
" video_transcription = whisper.transcribe(\n",
" whisper_model,\n",
" str(video_path_local),\n",
" temperature=temperature,\n",
" **args,\n",
" )\n",
"\n",
" # Save output\n",
" whisper.utils.get_writer(\n",
" output_format=output_format,\n",
" output_dir=video_path_local.parent\n",
" )(\n",
" video_transcription,\n",
" str(video_path_local.stem),\n",
" options=dict(\n",
" highlight_words=False,\n",
" max_line_count=None,\n",
" max_line_width=None,\n",
" )\n",
" )\n",
"\n",
" def exportTranscriptFile(ext: str):\n",
" local_path = video_path_local.parent / video_path_local.with_suffix(ext).name\n",
" export_path = drive_whisper_path / video_path_local.with_suffix(ext).name\n",
" shutil.copy(\n",
" local_path,\n",
" export_path\n",
" )\n",
" display(Markdown(f\"**Transcript file created: {export_path}**\"))\n",
"\n",
" if output_format==\"all\":\n",
" for ext in ('.txt', '.vtt', '.srt', '.tsv', '.json'):\n",
" exportTranscriptFile(ext)\n",
" else:\n",
" exportTranscriptFile(\".\" + output_format)\n"
]
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "Ad6n1m4deAHp"
},
"execution_count": null,
"outputs": []
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"provenance": [],
"toc_visible": true,
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}