-
Notifications
You must be signed in to change notification settings - Fork 56
/
Copy pathNormalization.fsx
356 lines (267 loc) · 12.4 KB
/
Normalization.fsx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
(**
---
title: Normalization
index: 18
category: Documentation
categoryindex: 0
---
*)
(*** hide ***)
(*** condition: prepare ***)
#I "../src/FSharp.Stats/bin/Release/netstandard2.0/"
#r "FSharp.Stats.dll"
#r "nuget: Plotly.NET, 4.0.0"
Plotly.NET.Defaults.DefaultDisplayOptions <-
Plotly.NET.DisplayOptions.init (PlotlyJSReference = Plotly.NET.PlotlyJSReference.NoReference)
(*** condition: ipynb ***)
#if IPYNB
#r "nuget: Plotly.NET, 4.0.0"
#r "nuget: Plotly.NET.Interactive, 4.0.0"
#r "nuget: FSharp.Stats"
open Plotly.NET
#endif // IPYNB
open Plotly.NET
open Plotly.NET.StyleParam
open Plotly.NET.LayoutObjects
(**
# Normalization
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/fslaborg/FSharp.Stats/gh-pages?urlpath=/tree/home/jovyan/Normalization.ipynb)
[![Notebook]({{root}}img/badge-notebook.svg)]({{root}}{{fsdocs-source-basename}}.ipynb)
_Summary:_ this tutorial demonstrates multiple ways of data normalization accross several samples
### Table of contents
- [Introduction](#Introduction)
- [Median of Ratios](#Median-of-ratios)
- [Quantile normalization](#Quantile-normalization)
## Introduction
When you want to compare e.g. intensity measurements of elements between samples, you often have to normalize the samples in order
to be able to perform a valid comparison. Samples may vary in their average intensity just because of the technical nature of the measurement itself,
thereby shifting or distorting the underlying correct/real distribution. For the presented normalization strategies it is assumed that global changes across the samples are due to unwanted technical variability and only
a small number of elements are dysregulated (Zhao et al, 2020).
## Median of ratios
To correct for batch effects or technical measurment differences the median of ratios normalization (MOR) can be applied.
As explained above, it expects that the majority of genes/proteins/elements do not differ between the conditions that should be compared [Love et al. 2014](https://doi.org/10.1186/s13059-014-0550-8).
At first a reference sample is determined by calculating the geometric mean of each of the **n** elements across all **m** samples. The calculation of the geometric mean as
the n<sup>th</sup> square root of the product of all values (1) seems odd, but when displayed as mean of the log-transformed data (2) it becomes clear, that the geometric
mean is just an outlier-insensitive measure of a robust average, which is intuitive to do when dealing with biological data or in general data with some extreme outliers.
1. <img src="https://render.githubusercontent.com/render/math?math=geomean = \sqrt[n]{\prod_{}^{}x}">
2. <img src="https://render.githubusercontent.com/render/math?math=geomean = e^{\frac{1}{n}\sum_{}^{}log(x)}">
After a reference sample for each row is determined, for each entry a correction factor to reach the reference is calculated. Each sample now has n associated correction factors.
By taking the median of those individual correction factors, an outlier-insensitive measure of the true correction factor is determined. By choosing the median it is assumed that in theory >=50% of all
measured elements should not change between the measured samples.
No prior log-transform has to be applied before normalizing the data with this method. If required a log transform can be applied to the normalized values to restore homoscedasticity.
In the following a step-by-step introduction is given with finally applying `Signal.Normalization.medianOfRatios` function to the given data.
**1. Raw intensities for m=3 samples and n=6 elements:**
|elementID|sample1|sample2|sample3|
|--|--|--|--|
|g00|100|130|30|
|g01|80|200|30|
|g02|0|50|0|
|g03|40|50|20|
|g04|50|45|25|
|g05|40|50|15|
**2. Zero intensities cannot be processed during the MOR-normalization.**
Either imputation, row filtering or data transformation must be performed. In this example a single count is added to each intensity.
Note that this is not necessarily reasonable for every analysis! The reference sample is determined by calculating the geometric mean of every row.
|elementID|sample1|sample2|sample3|reference|
|--|--|--|--|--|
|g00|101|131|31|74.30|
|g01|81|201|31| 79.62|
|g02|1|51|1| 3.71 |
|g03|41|51|21| 35.28|
|g04|51|56|26| 42.03|
|g05|41|51|16| 32.22|
**3. For every element the correction factor is determined by dividing it by its reference.**
|elementID|sample1|sample2|sample3|reference|corr1|corr2|corr3|
|--|--|--|--|--|--|--|--|
|g00|101|131|31|74.30|1.359|1.763|0.417|
|g01|81|201|31| 79.62|1.017|2.525|0.389|
|g02|1|51|1| 3.71 |0.270|13.752|0.270|
|g03|41|51|21| 35.28|1.162|1.446|0.595|
|g04|51|56|26| 42.03|1.213|1.332|0.619|
|g05|41|51|16| 32.22|1.272|1.583|0.497|
It becomes apparent, that sample1 and sample2 are overrepresented and sample3 is underepresented. g02 and g03 are obviously outliers that differ between the samples.
These elements should not influence the correction factors. By determining the median of each sample correction column, a outlier insensitive approximation of the true correction
factor is achieved.
**4. The median of each correction factor column determines the final correction.**
|elementID|sample1|sample2|sample3|reference|corr1|corr2|corr3|
|--|--|--|--|--|--|--|--|
|g00|101|131|31|74.30|1.359|1.763|0.417|
|g01|81|201|31| 79.62|1.017|2.525|0.389|
|g02|1|51|1| 3.71 |0.270|13.752|0.270|
|g03|41|51|21| 35.28|1.162|1.446|0.595|
|g04|51|56|26| 42.03|1.213|1.332|0.619|
|g05|41|51|16| 32.22|1.272|1.583|0.497|
|||||||||
| | | | | **Median:** |**1.188**|**1.673**|**0.457**|
**5. Apply the correction factors to the original (untransformed) data.**
|elementID|sample1|sample2|sample3|reference|corr1|corr2|corr3||normedSample1|normedSample2|normedSample3|
|--|--|--|--|--|--|--|--|--|--|--|--|
|g00|101|131|31|74.30|1.359|1.763|0.417|| 84.19|77.71|65.66 |
|g01|81|201|31| 79.62|1.017|2.525|0.389|| 67.35|119.5|565.66|
|g02|1|51|1| 3.71 |0.270|13.752|0.270|| 0.00|29.89|0.00 |
|g03|41|51|21| 35.28|1.162|1.446|0.595|| 33.68|29.89|43.77 |
|g04|51|56|26| 42.03|1.213|1.332|0.619|| 42.10|32.88|54.72 |
|g05|41|51|16| 32.22|1.272|1.583|0.497|| 33.68|29.89|32.83 |
|||||||||||||
| | | | | **Median:** |**1.188**|**1.673**|**0.457**|||||
The normed intensities of the different elements now match a theoretical common intensity level.
*)
open FSharp.Stats
let rawData =
[|
[|100.; 130.; 30.|]
[| 80.; 200.; 30.|]
[| 0.; 50.; 0.|]
[| 40.; 50.; 20.|]
[| 50.; 45.; 25.|]
[| 40.; 50.; 15.|]
|]
|> matrix
// visualization of the raw data
let rawDataChart =
rawData.Transpose
|> Matrix.toJaggedArray
|> Array.mapi (fun sampleID sample ->
let sampleIntensities =
sample
|> Array.mapi (fun gID intensity ->
gID,intensity
)
Chart.Column(sampleIntensities,Name = sprintf "raw sample%i" (sampleID+1))
)
|> Chart.combine
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "gID"
|> Chart.withYAxisStyle "raw intensity"
|> Chart.withTitle "raw data"
(*** condition: ipynb ***)
#if IPYNB
rawDataChart
#endif // IPYNB
(***hide***)
rawDataChart |> GenericChart.toChartHTML
(***include-it-raw***)
(**
In the above figure you can see, that the green sample is underrepresented, and the orange is overrepresented when compared to the blue one.
No the MOR normalization is applied. If no transformation (+1) should be applied, you can use `Signal.Normalization.medianOfRatios`.
*)
/// actual median of ratios normalization. 1 is added to each intensity to get rid of the empty intensities.
let mor =
Signal.Normalization.medianOfRatiosBy (fun x -> x + 1.0) rawData
/// Matrix of normalized data in the same order as the input matrix
let morNormedData =
mor.NormedData
/// Correction factors to assess the strenght of the applied normalization (1 indicates no normalization)
let corrFactors =
let cf1 = Seq.item 0 mor.CorrFactors
let cf2 = Seq.item 1 mor.CorrFactors
let cf3 = Seq.item 2 mor.CorrFactors
sprintf "Correctionfactor for sample1 is %.3f\nCorrectionfactor for sample2 is %.3f\nCorrectionfactor for sample3 is %.3f" cf1 cf2 cf3
(*** condition: ipynb ***)
#if IPYNB
corrFactors
#endif // IPYNB
(***hide***)
corrFactors
(***include-it-raw***)
// visualization of the normed data
let normedDataChart =
morNormedData.Transpose
|> Matrix.toJaggedArray
|> Array.mapi (fun sampleID sample ->
let sampleIntensities =
sample
|> Array.mapi (fun gID intensity ->
gID,intensity
)
Chart.Column(sampleIntensities,Name = sprintf "MOR normed sample%i" (sampleID+1))
)
|> Chart.combine
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "gID"
|> Chart.withYAxisStyle "MOR intensity"
(*** condition: ipynb ***)
#if IPYNB
normedDataChart
#endif // IPYNB
(***hide***)
normedDataChart |> GenericChart.toChartHTML
(***include-it-raw***)
(**
The normed data now shows high similarity of the individual element intensities across samples.
# Quantile normalization
To compensate for the technical variance you also can perform a quantile normalization. It is a technique for making two or more
distributions identical in statistical properties and was originally developed for gene expression microarrays. It sees widespread use, constituting a standard part
of analysis pipelines for high-throughput analysis.
You can either quantile normalize data according to a given reference distribution (e.g. Gamma or Normal distribution) or create your own reference distribution out of your samples.
The same dataset as above is used and a quantile normalization is performed. Since no log transform is applied, there is no necessity to add a constant to each intensity:
**1. Raw intensities for m=3 samples and n=6 elements:**
|elementID|sample1|sample2|sample3|
|--|--|--|--|
|g00|100|130|30|
|g01|80|200|30|
|g02|0|50|0|
|g03|40|50|20|
|g04|50|45|25|
|g05|40|50|15|
**2. Intensities are indexed and sorted in ascending order. The row average intensity is determined.**
|sample1|sample2|sample3| |row average|
|--|--|--|--|--|
|3 -> 0|5 -> 45|3 -> 0 | |15|
|4 -> 40|3 -> 50|6 -> 15 | |35|
|6 -> 40|4 -> 50|4 -> 20 | |36.7|
|5 -> 50|6 -> 50|5 -> 25 | |41.7|
|2 -> 80|1 -> 130|1 -> 30 | |80|
|1 -> 100|2 -> 200|2 -> 30| |110|
**3. Every intensity is replaced by the row average intensity**
|sample1|sample2|sample3| |row average|
|--|--|--|--|--|
|3;15 |5;15 |3;15 ||15|
|4;35 |3;35 |6;35 ||35|
|6;36.7 |4;36.7 |4;36.7 ||36.7|
|5;41.7 |6;41.7 |5;41.7 ||41.7|
|2;80 |1;80 |1;80 ||80|
|1;110 |2;110 |2;110 ||110|
**4. Finally the columns are resorted to their original order using the indices. The indices are removed**
|elementID|normedSample1|normedSample2|normedSample3|
|--|--|--|--|
|g00|110 |80 |80 |
|g01|80 |110 |110 |
|g02|15 |35 |15 |
|g03|35 |36.7 |36.7 |
|g04|41.7 |15 |41.7 |
|g05|36.7 |41.7 |35 |
*)
let quantileNorm =
Signal.Normalization.quantile rawData
// visualization of the normed data
let normedDataQuantileChart =
quantileNorm.Transpose
|> Matrix.toJaggedArray
|> Array.mapi (fun sampleID sample ->
let sampleIntensities =
sample
|> Array.mapi (fun gID intensity ->
gID,intensity
)
Chart.Column(sampleIntensities,Name = sprintf "quantile normed sample%i" (sampleID+1))
)
|> Chart.combine
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "gID"
|> Chart.withYAxisStyle "qNorm intensity"
(*** condition: ipynb ***)
#if IPYNB
normedDataQuantileChart
#endif // IPYNB
(***hide***)
normedDataQuantileChart |> GenericChart.toChartHTML
(***include-it-raw***)
(**
As seen in the normalized column chart, the final samples all consists of the same values, just assigned to different row indices according to the rank within the sample.
For data with low row count, this normalization is not appropriate because there is a severe disturbance of the intensity values. The more data is incorporated (more rows), the lower
the single value influence will be.
## Notes
You can assess the quality of your normalization by performing a PCA prior and after normalization and compare different normalization strategies. While the
correction factors for MOR can be investigated, quantile normalization is a non-linear transformation since every value is normalized by its own rank-row mean.
*)