Skip to content

feat: Qwen2.5-omni-7b full modal speech recognition #3870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
AliyunBaiLianEmbeddingCredential
from models_provider.impl.aliyun_bai_lian_model_provider.credential.image import QwenVLModelCredential
from models_provider.impl.aliyun_bai_lian_model_provider.credential.llm import BaiLianLLMModelCredential
from models_provider.impl.aliyun_bai_lian_model_provider.credential.omi_stt import AliyunBaiLianOmiSTTModelCredential
from models_provider.impl.aliyun_bai_lian_model_provider.credential.omni_stt import AliyunBaiLianOmiSTTModelCredential
from models_provider.impl.aliyun_bai_lian_model_provider.credential.reranker import \
AliyunBaiLianRerankerCredential
from models_provider.impl.aliyun_bai_lian_model_provider.credential.stt import AliyunBaiLianSTTModelCredential
Expand All @@ -24,7 +24,7 @@
from models_provider.impl.aliyun_bai_lian_model_provider.model.embedding import AliyunBaiLianEmbedding
from models_provider.impl.aliyun_bai_lian_model_provider.model.image import QwenVLChatModel
from models_provider.impl.aliyun_bai_lian_model_provider.model.llm import BaiLianChatModel
from models_provider.impl.aliyun_bai_lian_model_provider.model.omi_stt import AliyunBaiLianOmiSpeechToText
from models_provider.impl.aliyun_bai_lian_model_provider.model.omni_stt import AliyunBaiLianOmiSpeechToText
from models_provider.impl.aliyun_bai_lian_model_provider.model.reranker import AliyunBaiLianReranker
from models_provider.impl.aliyun_bai_lian_model_provider.model.stt import AliyunBaiLianSpeechToText
from models_provider.impl.aliyun_bai_lian_model_provider.model.tti import QwenTextToImageModel
Expand Down Expand Up @@ -80,6 +80,9 @@
ModelInfo('qwen-omni-turbo',
_('The Qwen Omni series model supports inputting multiple modalities of data, including video, audio, images, and text, and outputting audio and text.'),
ModelTypeConst.STT, aliyun_bai_lian_omi_stt_model_credential, AliyunBaiLianOmiSpeechToText),
ModelInfo('qwen2.5-omni-7b',
_('The Qwen Omni series model supports inputting multiple modalities of data, including video, audio, images, and text, and outputting audio and text.'),
ModelTypeConst.STT, aliyun_bai_lian_omi_stt_model_credential, AliyunBaiLianOmiSpeechToText),
]

module_info_vl_list = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review and Recommendations

  1. Code Fragment: The code contains multiple changes across modules:

    • Changed modelsProvider.modelsImpl.aliyunBaiLian.credential.omi_stt to modelsProvider.modelsImpl.aliyunBaiLian.credential.omni_stt.

    • Replaced occurrences of _model_credential with _modelcredential.

    • Added two new model entries:

      ModelInfo('qwen2.5-omni-7b', ...
  2. Import Statements:

    • Consistent change from QwenVLModelCredential to QwenVLChatModel.
    • Updated all related classes and models.
  3. Comments:

    • No significant comment modifications noted.

Overall Conclusion

  • The provided changes appear consistent with the overall structure and functionality of the API.

No major technical issues were identified in this snippet. However, consider adding comments above each entry (like ModelInfo) explaining their purpose, given that they might not always be immediately self-explanatory without context.

Additional Advice for Quality Assurance:

For robustness before release, consider these enhancements:

  1. Validation Checks: Add checks to ensure that credential and model configurations match expected types before usage.
  2. Edge Case Testing: Cover scenarios where user inputs may differ slightly to identify potential bugs early.
  3. Testing Frameworks: Integrate existing testing frameworks to automate repeated unit, integration, and end-to-end tests after making these updates.
  4. Performance Monitoring: After deployment, set up performance monitoring tools to catch regressions quickly when updating similar APIs.

Feel free to add more detailed feedback or ask about specific areas if needed!

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ class AliyunBaiLianOmiSTTModelParams(BaseForm):


class AliyunBaiLianOmiSTTModelCredential(BaseForm, BaseModelCredential):
api_key = PasswordInputField("API key", required=True)
api_url = forms.TextInputField(_('API URL'), required=True)
api_key = forms.PasswordInputField(_('API Key'), required=True)

def is_valid(self,
model_type: str,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

class AliyunBaiLianOmiSpeechToText(MaxKBBaseModel, BaseSpeechToText):
api_key: str
api_url: str
model: str
params: dict

Expand All @@ -20,6 +21,7 @@ def __init__(self, **kwargs):
self.api_key = kwargs.get('api_key')
self.model = kwargs.get('model')
self.params = kwargs.get('params')
self.api_url = kwargs.get('api_url')

@staticmethod
def is_cache_model():
Expand All @@ -30,6 +32,7 @@ def new_instance(model_type, model_name, model_credential: Dict[str, object], **
return AliyunBaiLianOmiSpeechToText(
model=model_name,
api_key=model_credential.get('api_key'),
api_url=model_credential.get('api_url') ,
params= model_kwargs,
**model_kwargs
)
Expand All @@ -47,13 +50,13 @@ def speech_to_text(self, audio_file):
client = OpenAI(
# 若没有配置环境变量,请用阿里云百炼API Key将下行替换为:api_key="sk-xxx",
api_key=self.api_key,
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
base_url=self.api_url,
)

base64_audio = base64.b64encode(audio_file.read()).decode("utf-8")

completion = client.chat.completions.create(
model="qwen-omni-turbo-0119",
model=self.model,
messages=[
{
"role": "user",
Expand All @@ -71,16 +74,15 @@ def speech_to_text(self, audio_file):
],
# 设置输出数据的模态,当前支持两种:["text","audio"]、["text"]
modalities=["text"],
audio={"voice": "Cherry", "format": "mp3"},
# stream 必须设置为 True,否则会报错
stream=True,
stream_options={"include_usage": True},
)
result = []
for chunk in completion:
if chunk.choices and hasattr(chunk.choices[0].delta, 'audio'):
transcript = chunk.choices[0].delta.audio.get('transcript')
result.append(transcript)
if chunk.choices and hasattr(chunk.choices[0].delta, 'content'):
content = chunk.choices[0].delta.content
result.append(content)
return "".join(result)

except Exception as err:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,6 @@ def new_instance(model_type, model_name, model_credential: Dict[str, object], **
optional_params['max_tokens'] = model_kwargs['max_tokens']
if 'temperature' in model_kwargs and model_kwargs['temperature'] is not None:
optional_params['temperature'] = model_kwargs['temperature']
if model_name == 'qwen-omni-turbo':
optional_params['streaming'] = True
return AliyunBaiLianSpeechToText(
model=model_name,
api_key=model_credential.get('api_key'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code snippet has some issues:

  1. Optional Parameters Handling: The optional_params dictionary uses the same key 'max_tokens', which can lead to overwriting values in case multiple sources provide this parameter.

  2. Qwen Model Streaming Support: For the 'qwen-omni-turbo' model, there's an assumption that streaming should be enabled. However, it seems like the logic needs adjustments because if both model_name == 'qwen-omni-turbo' and model_kwargs['streaming'] is set elsewhere (and possibly overridden by another source), this line might still enable streaming despite its presence in the original parameters.

  3. Dictionary Usage: While using Dict[str, object] is flexible, using Any instead could help improve type-checking by letting you specify more precise types for the keys and values.

  4. Comments Clarity: The comments describe what each part of the code does; however, ensuring they are accurate with respect to actual behavior could make the function easier to reason about.

Here are the suggested improvements:

from typing import Dict

def new_instance(
    model_type: str,
    model_name: str,
    model_credential: Dict[str, Any],
    **model_kwargs
) -> AliasBaiLianSpeechToText:
    optional_params = {
        "max_tokens": 1000,  # Default value for max tokens, adjust as needed based on service documentation
        "temperature": 0.7   # Default temperature for generation quality, adjust as needed
    }
    
    # Update optional parameters from kwargs if present and not None
    if 'max_tokens' in model_kwargs and model_kwargs['max_tokens'] is not None:
        optional_params['max_tokens'] = model_kwargs['max_tokens']
    if 'temperature' in model_kwargs and model_kwargs['temperature'] is not None:
        optional_params['temperature'] = model_kwargs['temperature']
    
    # Enable streaming explicitly for Qwen-omni-turbo if necessary
    if model_name.lower() == 'qwen-omni-turbo':
        optional_params['streaming'] = True
    
    return AliyunBaiLianSpeechToText(
        model=model_name,
        api_key=model_credential.get('api_key'),
        **optional_params
    )

Key Changes:

  • Used Dict[str, Any] to allow runtime-specific key-value pairs in model_kwargs.
  • Removed hardcoding of default values (optional_params) for clarity.
  • Explicitly enabled streaming only when the model name matches 'qwen-omni-turbo', assuming lowercase comparisons align with the expected input format.
  • Improved commenting to reflect the current behavior better.

Expand Down
Loading