Skip to content

FEAT: Support Emojis for data types #166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

gargsaumya
Copy link
Contributor

@gargsaumya gargsaumya commented Aug 5, 2025

Work Item / Issue Reference

AB#37664

GitHub Issue: #<ISSUE_NUMBER>


Summary

This pull request improves Unicode string handling in the SQL type mapping logic and adds comprehensive tests for round-tripping emoji and special characters in the database. The main focus is on ensuring that Unicode strings, including those with emojis and special characters, are correctly mapped and stored in SQL Server.

Unicode string mapping improvements:

  • Updated the _map_sql_type method in cursor.py to use the UTF-16 code unit length for Unicode string parameters instead of the Python string length, ensuring correct sizing for both long and short Unicode strings. [1] [2]

Testing enhancements:

  • Added a new test test_emoji_round_trip in test_004_cursor.py to verify that various emoji, accented, and non-Latin characters can be inserted and retrieved accurately from the database, improving coverage for Unicode edge cases.

@Copilot Copilot AI review requested due to automatic review settings August 5, 2025 07:05
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request enhances Unicode string handling in the SQL type mapping logic to properly support emojis and special characters. The key improvement addresses the incorrect sizing calculation for Unicode strings by using UTF-16 code unit length instead of Python string length.

  • Updated _map_sql_type method to calculate UTF-16 code unit length for Unicode parameters
  • Added comprehensive test coverage for emoji and special character round-trip operations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
mssql_python/cursor.py Fixed Unicode string length calculation by using UTF-16 encoding length instead of Python string length
tests/test_004_cursor.py Added comprehensive test for emoji and special character round-trip database operations

@github-actions github-actions bot added the pr-size: small Minimal code update label Aug 6, 2025
@gargsaumya gargsaumya force-pushed the saumya/emoji-support branch from 353dad1 to 6ef29c1 Compare August 6, 2025 10:18
@github-actions github-actions bot added pr-size: small Minimal code update and removed pr-size: small Minimal code update labels Aug 6, 2025
@github-actions github-actions bot added pr-size: small Minimal code update and removed pr-size: small Minimal code update labels Aug 6, 2025
@github-actions github-actions bot added pr-size: small Minimal code update and removed pr-size: small Minimal code update labels Aug 6, 2025
@github-actions github-actions bot added pr-size: small Minimal code update and removed pr-size: small Minimal code update labels Aug 6, 2025
@github-actions github-actions bot added pr-size: small Minimal code update and removed pr-size: small Minimal code update labels Aug 6, 2025
@github-actions github-actions bot added pr-size: small Minimal code update and removed pr-size: small Minimal code update labels Aug 6, 2025
bewithgaurav
bewithgaurav previously approved these changes Aug 6, 2025
@github-actions github-actions bot added pr-size: small Minimal code update and removed pr-size: small Minimal code update labels Aug 6, 2025
@github-actions github-actions bot added pr-size: small Minimal code update and removed pr-size: small Minimal code update labels Aug 6, 2025
@github-actions github-actions bot added pr-size: small Minimal code update and removed pr-size: small Minimal code update labels Aug 6, 2025
@@ -323,10 +323,11 @@ def _map_sql_type(self, param, parameters_list, i):
# TODO: revisit
if len(param) > 4000: # Long strings
if is_unicode:
utf16_len = len(param.encode("utf-16-le")) // 2
Copy link
Contributor

@sumitmsft sumitmsft Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap utf16_len = len(param.encode("utf-16-le")) // 2 in a function and then check on the following: Use try-catch for encoding failures (to handle Unicode Encode error)?

Something like:

def _calculate_utf16_length(self, param):
   ...
   ...
    try:
        return len(param.encode("utf-16-le")) // 2
    except UnicodeEncodeError as e:                    // or some other exception
        # Log error and fallback to Python length
        logger.warning(f"UTF-16 encoding failed: {e}")
        return len(param)

std::vector<SQLWCHAR> result(str.size() + 1, 0); // +1 for null terminator
for (size_t i = 0; i < str.size(); ++i) {
result[i] = static_cast<SQLWCHAR>(str[i]);
std::vector<SQLWCHAR> result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put some comments around Surrogate logic to better understand the code?


for (wchar_t wc : str) {
uint32_t codePoint = static_cast<uint32_t>(wc);
if (codePoint >= 0xD800 && codePoint <= 0xDFFF) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make 0xD800, 0xDFFF, 0x10FFFF named constants?

for (wchar_t wc : str) {
uint32_t codePoint = static_cast<uint32_t>(wc);
if (codePoint >= 0xD800 && codePoint <= 0xDFFF) {
// Skip invalid lone surrogates (shouldn't occur in well-formed wchar_t strings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case for "Hello 🚀 World" and "admin🔒user" string and see if there is any corruption happening? Just wanted to check if this goes through and that we do not drop drop surrogates.

If the surrogates are being dropped, then there could be a chance for SQL injection with these type of queries:

"SELECT * FROM users WHERE id = 1🚀' OR '1'='1"

@@ -275,14 +275,12 @@ SQLRETURN BindParameters(SQLHANDLE hStmt, const py::list& params,

// Reserve space and convert from wstring to SQLWCHAR array
sqlwcharBuffer->resize(strParam->size() + 1, 0); // +1 for null terminator
std::vector<SQLWCHAR> utf16 = WStringToSQLWCHAR(*strParam);
sqlwcharBuffer->assign(utf16.begin(), utf16.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WStringToSQLWCHAR() can return vectors of unpredictable size because UTF-16 conversion may expand data significantly and surrogate pairs double the storage requirements for emoji (as emojis needs 2 code units in UTF-16).

But, vector::assign() doesn't validate capacity and will attempt to store all elements regardless of buffer size.

If utf16.size() > sqlwcharBuffer then we will hit buffer overflow.

There is no error handling or logging for size mismatches

@@ -1528,7 +1526,10 @@ SQLRETURN SQLGetData_wrap(SqlHandlePtr StatementHandle, SQLUSMALLINT colCount, p
if (numCharsInData < dataBuffer.size()) {
// SQLGetData will null-terminate the data
#if defined(__APPLE__) || defined(__linux__)
row.append(SQLWCHARToWString(dataBuffer.data(), SQL_NTS));
auto raw_bytes = reinterpret_cast<const char*>(dataBuffer.data());
py::bytes py_bytes(raw_bytes, dataLen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, py::bytes py_bytes(raw_bytes, dataLen) does not validate that dataLen fits within the allocated buffer. If the dataLen is larger than the real buffer, this can leak memory beyond the intended area—exposing potentially sensitive data (like passwords, other users’ data, etc.) to Python.

Let's check if dataLen is non-negative and does not exceed the actual buffer size before creating the Python bytes object.

} else if (codePoint <= 0x10FFFF) {
// Encode as surrogate pair
codePoint -= 0x10000;
SQLWCHAR highSurrogate = static_cast<SQLWCHAR>((codePoint >> 10) + 0xD800);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function assumes that each wchar_t wc from the input contains a valid Unicode Codepoint, but there's no validation that the wchar_t values are actually valid Unicode.

Since, wchar_t could be different on Win (UTF 16) and Linux (UTF32) it might cause issues.

wstring may contain invalid or malformed codepoint and may cause issues later on.

Copy link
Contributor

@sumitmsft sumitmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, please see if we can fix them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-size: small Minimal code update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants