Skip to content

Latest commit

 

History

History

Data Processing

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Data Processing

Data Smoothing

Savitzky-Golay Filter

A Savitzky–Golay filter is a digital filter that can be applied to a set of digital data points for the purpose of smoothing the data, that is, to increase the precision of the data without distorting the signal tendency. This is achieved, in a process known as convolution, by fitting successive subsets of adjacent data points with a low-degree polynomial by the method of linear least squares. When the data points are equally spaced, an analytical solution to the least-squares equations can be found, in the form of a single set of "convolution coefficients" that can be applied to all data sub-sets, to give estimates of the smoothed signal, (or derivatives of the smoothed signal) at the central point of each sub-set.

In every step, the window moves and a different part of the original dataset is used. Then, the local polynomial function is fitted to the data in the window, and a new data point is calculated using the polynomial function. After that, the window moves to the next part of the dataset, and the process repeats.

from scipy.signal import savgol_filter

scipy.signal.savgol_filter(input_data, window_length, polyorder)

df_time_series['savgol'] = df_time_series['prediction'].transform(lambda x: savgol_filter(x, 5,2))
  • The window size parameter specifies how many data points will be used to fit a polynomial regression function. The second parameter specifies the degree of the fitted polynomial function (if we choose 1 as the polynomial degree, we end up using a linear regression function).
  • The larger the window the less accurate the fitting and the smoothing procedures because we will force the function to average a greater portion of the signal.
  • In order to have Savitzky-Golay filter working properly, should always choose an odd number for the window size and the order of the polynomial function should always be a number lower than the window size.

Whittaker–Shannon interpolation

The Whittaker smoother attempts to fit a curve that represents the raw data, but is penalized if subsequent points vary too much. The Whittaker filter is a balancing between the residual to the original data and the “smoothness” of the fitted curve.

Batch vs Stream

Batch Stream
Larger datasets Simpler analysis: aggregation / filtering
More complex analysis Individual records / micro batches
Slower moving data (hours, days) Data moves FAST
With batch processing, a batch of information is collected before being sent in for processing With streaming, data is sent for analysis piece-by-piece, and processed in real time

Stream

Roles vs Users

Roles Users
Live in IAM Live in IAM
Have permissions policies Have permissions policies
Only attach to other services Can act on their own
Do not have keys nor login Can have keys or login

Convert Pandas DataFrame to bytes-like object

import io

towrite = io.BytesIO()
df.to_excel(towrite)  # write to BytesIO buffer
towrite.seek(0) 

print(towrite)
> b''
print(type(towrite))
> _io.BytesIO

if you want to see the bytes-like object use getvalue,

print(towrite.getvalue())
> b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x00!\x00<\xb