-
Notifications
You must be signed in to change notification settings - Fork 16
FEAT: Support Emojis for data types #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request enhances Unicode string handling in the SQL type mapping logic to properly support emojis and special characters. The key improvement addresses the incorrect sizing calculation for Unicode strings by using UTF-16 code unit length instead of Python string length.
- Updated
_map_sql_type
method to calculate UTF-16 code unit length for Unicode parameters - Added comprehensive test coverage for emoji and special character round-trip operations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
mssql_python/cursor.py | Fixed Unicode string length calculation by using UTF-16 encoding length instead of Python string length |
tests/test_004_cursor.py | Added comprehensive test for emoji and special character round-trip database operations |
353dad1
to
6ef29c1
Compare
@@ -323,10 +323,11 @@ def _map_sql_type(self, param, parameters_list, i): | |||
# TODO: revisit | |||
if len(param) > 4000: # Long strings | |||
if is_unicode: | |||
utf16_len = len(param.encode("utf-16-le")) // 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap utf16_len = len(param.encode("utf-16-le")) // 2
in a function and then check on the following: Use try-catch for encoding failures (to handle Unicode Encode error)?
Something like:
def _calculate_utf16_length(self, param):
...
...
try:
return len(param.encode("utf-16-le")) // 2
except UnicodeEncodeError as e: // or some other exception
# Log error and fallback to Python length
logger.warning(f"UTF-16 encoding failed: {e}")
return len(param)
std::vector<SQLWCHAR> result(str.size() + 1, 0); // +1 for null terminator | ||
for (size_t i = 0; i < str.size(); ++i) { | ||
result[i] = static_cast<SQLWCHAR>(str[i]); | ||
std::vector<SQLWCHAR> result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put some comments around Surrogate logic to better understand the code?
|
||
for (wchar_t wc : str) { | ||
uint32_t codePoint = static_cast<uint32_t>(wc); | ||
if (codePoint >= 0xD800 && codePoint <= 0xDFFF) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make 0xD800, 0xDFFF, 0x10FFFF named constants?
for (wchar_t wc : str) { | ||
uint32_t codePoint = static_cast<uint32_t>(wc); | ||
if (codePoint >= 0xD800 && codePoint <= 0xDFFF) { | ||
// Skip invalid lone surrogates (shouldn't occur in well-formed wchar_t strings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case for "Hello 🚀 World" and "admin🔒user" string and see if there is any corruption happening? Just wanted to check if this goes through and that we do not drop drop surrogates.
If the surrogates are being dropped, then there could be a chance for SQL injection with these type of queries:
"SELECT * FROM users WHERE id = 1🚀' OR '1'='1"
@@ -275,14 +275,12 @@ SQLRETURN BindParameters(SQLHANDLE hStmt, const py::list& params, | |||
|
|||
// Reserve space and convert from wstring to SQLWCHAR array | |||
sqlwcharBuffer->resize(strParam->size() + 1, 0); // +1 for null terminator | |||
std::vector<SQLWCHAR> utf16 = WStringToSQLWCHAR(*strParam); | |||
sqlwcharBuffer->assign(utf16.begin(), utf16.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WStringToSQLWCHAR() can return vectors of unpredictable size because UTF-16 conversion may expand data significantly and surrogate pairs double the storage requirements for emoji (as emojis needs 2 code units in UTF-16).
But, vector::assign()
doesn't validate capacity and will attempt to store all elements regardless of buffer size.
If utf16.size() > sqlwcharBuffer then we will hit buffer overflow.
There is no error handling or logging for size mismatches
@@ -1528,7 +1526,10 @@ SQLRETURN SQLGetData_wrap(SqlHandlePtr StatementHandle, SQLUSMALLINT colCount, p | |||
if (numCharsInData < dataBuffer.size()) { | |||
// SQLGetData will null-terminate the data | |||
#if defined(__APPLE__) || defined(__linux__) | |||
row.append(SQLWCHARToWString(dataBuffer.data(), SQL_NTS)); | |||
auto raw_bytes = reinterpret_cast<const char*>(dataBuffer.data()); | |||
py::bytes py_bytes(raw_bytes, dataLen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, py::bytes py_bytes(raw_bytes, dataLen)
does not validate that dataLen fits within the allocated buffer. If the dataLen is larger than the real buffer, this can leak memory beyond the intended area—exposing potentially sensitive data (like passwords, other users’ data, etc.) to Python.
Let's check if dataLen is non-negative and does not exceed the actual buffer size before creating the Python bytes object.
} else if (codePoint <= 0x10FFFF) { | ||
// Encode as surrogate pair | ||
codePoint -= 0x10000; | ||
SQLWCHAR highSurrogate = static_cast<SQLWCHAR>((codePoint >> 10) + 0xD800); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function assumes that each wchar_t wc
from the input contains a valid Unicode Codepoint, but there's no validation that the wchar_t values are actually valid Unicode.
Since, wchar_t could be different on Win (UTF 16) and Linux (UTF32) it might cause issues.
wstring may contain invalid or malformed codepoint and may cause issues later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, please see if we can fix them.
Work Item / Issue Reference
Summary
This pull request improves Unicode string handling in the SQL type mapping logic and adds comprehensive tests for round-tripping emoji and special characters in the database. The main focus is on ensuring that Unicode strings, including those with emojis and special characters, are correctly mapped and stored in SQL Server.
Unicode string mapping improvements:
_map_sql_type
method incursor.py
to use the UTF-16 code unit length for Unicode string parameters instead of the Python string length, ensuring correct sizing for both long and short Unicode strings. [1] [2]Testing enhancements:
test_emoji_round_trip
intest_004_cursor.py
to verify that various emoji, accented, and non-Latin characters can be inserted and retrieved accurately from the database, improving coverage for Unicode edge cases.