This repository has been archived by the owner on Aug 21, 2022. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathfloatPrecision.py
executable file
·90 lines (69 loc) · 2.12 KB
/
floatPrecision.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
#!/usr/bin/env python3
"""
Demonstrates limits of IEEE754 floating point precision
see Wikipedia "unit in the last place" "machine epsilon"
"Machine epsilon is defined as the smallest number that, when added to one, yields a result different from one."
Compare with:
Matlab/Octave
-------------
>> eps(double(1)) => 2.2204e-16
>> eps(single(1)) => 1.1921e-07
https://blogs.mathworks.com/cleve/2017/05/22/quadruple-precision-128-bit-floating-point-arithmetic/
Matlab and Octave can support quad precision using external libraries, with great slowdowns in computing.
I would consider Fortran for a project needing quad precision
Fortran
-------
bits of precision for 32-bit float 24
machine epsilon: 32-bit float 1.19209290E-07
bits of precision for 64-bit float 53
machine epsilon: 64-bit float 2.2204460492503131E-016
bits of precision for 128-bit float 113
machine epsilon: 128-bit float 1.92592994438723585305597794258492732E-0034
Half precision:
---------------
GCC: currently only on ARM
https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html
Matlab: via external libraries.
Python: Numpy
"""
import numpy as np
# %% half prec
ph = 0
h = np.float16(1)
while h != h + np.float16(1):
h *= np.float16(2)
ph += 1
eps16 = 2 ** (-(ph - 1))
print("bits of precision for 16-bit float", ph)
print("machine epsilon: 16-bit float", eps16)
# %% single prec
ps = 0
s = np.float32(1)
while s != s + np.float32(1):
s *= np.float32(2)
ps += 1
eps32 = 2 ** (-(ps - 1))
print("bits of precision for 32-bit float", ps)
print("machine epsilon: 32-bit float", eps32)
# %% double prec
pd = 0
d = 1.0
while d != d + 1:
d *= 2.0
pd += 1
eps64 = 2 ** (-(pd - 1))
print("bits of precision for 64-bit float", pd)
print("machine epsilon: 64-bit float", eps64)
# %% quad prec
"""
caveats on Numpy "long double":
https://docs.scipy.org/doc/numpy-dev/user/basics.types.html#extended-precision
"""
pq = 0
q = np.float128(1.0)
while q != q + np.float128(1):
q *= np.float128(2.0)
pq += 1
eps128 = 2 ** (-(pq - 1))
print("bits of precision for long double", pq)
print("machine epsilon: long double", eps128)