32-bit Single-Precision Floating Point in Details - ByteScout (2024)

In modern days, programming languages tend to be as high-level as possible to make the programmer’s life a little bit easier. However, no matter how advanced programming language is, the code still has to be converted down to the machine code, via compilation, interpretation, or even virtual machine such as JVM. Of course, at this stage, rules are different: CPU works with addresses and registers without any classes, even «if» branches look like conditional jumps. One of the most important aspects of this execution is the arithmetic operation, and today we will be talking about one of these «cornerstones»: floating-point numbers and how they may affect your code.

A brief introduction to the history

The need for processing large or small values was present since the very first days of computers: even first designs of Charles Babbage’s Analytical Engine sometimes included floating-point arithmetic along with usual integer arithmetic. For a long time, the floating-point format was used primarily for scientific research, especially in physics, due to the large variety of data. It is extremely convenient that distance between Earth and Sun can be expressed in the same amount of bits as the distance between hydrogen and oxygen atoms in water molecules with the same relative precision and, even better, values of different magnitudes may be freely multiplied without large losses in precision.

Almost all the early implementations of floating-point numbers were software due to the complexity of the hardware implementations. Without this common standard, everybody had to come up with their own formats: this is how Microsoft Binary Format and IBM Floating Point Architecture were born; the latter is still used in some fields such as weather forecasting, although it is extremely rare by now.

Intel 8087 coprocessor, announced in 1980, also used its own format called «x87». It was the first coprocessor specifically dedicated to floating-point arithmetic with aims to replace slow library calls with the machine code. Then, based on x87 format, IEEE 754 was born as the first and successful attempt to create a universal standard for floating-point calculations. Soon, Intel started to integrate IEEE 754 into their CPUs, and nowadays almost every system except some embedded ones supports the floating-point format.

Theory and experiments

In IEEE 754 single-precision binary floating-point format, 32 bits are split into 1-bit sign flag, 8-bit exponent flag, and 23-bit fraction part, in that order (bit sign is the leftmost bit). This information should be enough for us to start some experiments! Let us see how number 1.0 looks like in this format using this simple C code:

union { float in; unsigned out;} converter; converter.in = float_number; unsigned bits = converter.out;

Of course, after getting the bits variable, we only need to print it. For instance, this way:

1.0 | 1 | S: 0 E: 01111111 M: 00000000000000000000000

Common sense tells that 1 can be expressed in binary fluting-point form as 1.0 * 2<sup>0</sup>, so exponent is 0 and significand is 1, while in IEEE 754 exponent is 1111111 (127 in decimal) and significand is 0.

The mystery behind exponent is simple: the exponent is actually shifted. A zero exponent is represented as 127; exponent of 1 is represented as 128 and so on. Maximum value of exponent should be 255 – 127 = 128, and minimum value should be 0 – 127 = -127. However, values 255 and 0 are reserved, so the actual range is -126…127. We will talk about those reserved values later.

The significand is even simpler to explain. Binary significand has one unique property: every significand in normalized form, except for zero, starts with 1 (this is only true for binary numbers). Next, if a number starts with zero, then it is not normalized. For instance, non-normalized 0.000101 * 10<sup>101</sup> is the same as normalized 1.01 * 10<sup>1</sup>. Due to that, there is no need to write an initial 1 for normalized numbers: we can just keep it in mind, saving space for one more significant bit. In our case, the actual significand is 1 and 23 zeroes, but because 1 is skipped, it is only 23 zeroes.

Let us try some different numbers in comparison with 1.

1.0 | 1 | S: 0 E: 01111111 M: 00000000000000000000000

-1.0 | -1 | S: 1 E: 01111111 M: 00000000000000000000000

2.0 | 2 | S: 0 E: 10000000 M: 00000000000000000000000

4.0 | 4 | S: 0 E: 10000001 M: 00000000000000000000000

1 / 8 | 0.125 | S: 0 E: 01111100 M: 00000000000000000000000

As we can see, a negative sign just inverts sign flag without touching the rest (this seems obvious, but it is not always the case in computer science: for integers, a negative sign is much more complex than just flipping one bit!). Changing the exponent by trying different powers of two works as expected.

1.0 | 1 | S: 0 E: 01111111 M: 00000000000000000000000

3.0 | 3 | S: 0 E: 10000000 M: 10000000000000000000000

5.0 | 5 | S: 0 E: 10000001 M: 01000000000000000000000

0.2 | 0.2 | S: 0 E: 01111100 M: 10011001100110011001101

It is easy to see that numbers 3 and 5 are represented as 1.1 and 1.01 with aproper exponent. 0.2 should not differ much from them, but it is. What happened?

It is easier to explain on decimals. 0.2 is the same number as 1/5. At the same time, not each radical can be represented as a decimal floating-point number: for example, 2/3 is 0.666666… It happens because 3 does not have any non-trivial common divisors with 10 (10 = 2 * 5, neither of them is 3). In the same time, 2/3 can be easily represented in base 12 as 0.8 (12 = 2 * 2 * 3). The same trick is true for a binary system: 2 does not have any common divisors with 5, so 0.2 can only be represented as infinitely long 0.00110011001100… At the same time, we only have 23 significant bits! So, we are inevitably losing precision.

Let us try with some multiplications.

1.0 | 1 | S: 0 E: 01111111 M: 00000000000000000000000

0.2^2*25 | 1 | S: 0 E: 01111111 M: 00000000000000000000001

25*0.2^2 | 1 | S: 0 E: 01111111 M: 00000000000000000000000

Both 1 and 0.2 * 0.2 * 25 are printed as 1, but they are actually different! Due to the precision loss, 0.2 * 0.2 * 25 is not the same as 1, and the expression (0.2f * 0.2f * 25.0f == 1.0f) is actually false. At the same time, if we execute 25 * 0.2 first, then the result is actually correct. It means that the rule (a * b) * c = a * (b * c) is not always true for floating-point numbers!

Special numbers

Remember about the fact that zero can never be written in the normalized form because it does not contain any 1s in its binary representation? Zero is a special number.

0 | 0 | S: 0 E: 00000000 M: 00000000000000000000000

-0 | -0 | S: 1 E: 00000000 M: 00000000000000000000000

For zero, IEEE 754 uses an exponent value of 0 and a significand value of 0. In addition, as you can see, there are actually two zero values: +0 and -0. In terms of comparison, (0.0f == -0.0f), is actually true, sign just does not count. +0 and -0 loosely correspond to the mathematical concept of the infinitesimal, positive and negative.

Are there any special numbers with an exponent value of 0? Yes. They are called «denormalized numbers». Those numbers can represent extremely small values, lesser than the minimum normalized number (which should be a little larger than 1 * 2<sup>-127</sup>). Examples:

2^-126 | 1.17549e-38 | S: 0 E: 00000001 M: 00000000000000000000000

2^-127 | 5.87747e-39 | S: 0 E: 00000000 M: 10000000000000000000000

2^-128 | 2.93874e-39 | S: 0 E: 00000000 M: 01000000000000000000000

2^-149 | 1.4013e-45 | S: 0 E: 00000000 M: 00000000000000000000001

2^-150 | 0 | S: 0 E: 00000000 M: 00000000000000000000000

A denormalized number has the virtual exponent value of 1, but, at the same time, they do not have omitted 1 as their first omitted digit. The only consequence is that denormalized numbers quickly lose precision: to store numbers between 2<sup>-128</sup> and 2<sup>-127</sup>, we are only using 21 digits of information instead of 23.

It is easy to see that zero is the special case of the denormalized number. Moreover, as we can see, the least possible single-precision floating-point number is actually 2<sup>-149</sup>, or approximately 1.4013 * 10<sup>-45</sup>.

Numbers with the exponent of 11111111 are reserved for the «other end» of the number scale: Infinity and the special value called «Not a Number».

1 / 0 | inf | S: 0 E: 11111111 M: 00000000000000000000000

1 / -0 | -inf | S: 1 E: 11111111 M: 00000000000000000000000

2^128 | inf | S: 0 E: 11111111 M: 00000000000000000000000

As with zeroes, infinity can be either positive or negative. It can be achieved by dividing any non-zero number to zero or by getting any number larger than maximum allowed (which is a little less than 2<sup>128</sup>). Infinity is processed as follows:

Infinity > Any number

Infinity = Infinity

Infinity > -Infinity

Any number / Infinity = 0 (sign is set properly)

Infinity * Infinity = Infinity (again, sign is set properly)

Infinity / Infinity = NaN

Infinity * 0 = NaN

Not a Number, or NaN, is, perhaps, the most interesting floating-point value. It can be obtained in multiple ways. First, it is the result of any indeterminate form:

Infinity * 0

0 / 0 or Infinity / Infinity

Infinity – Infinity or –Infinity + Infinity

Secondly, it can be the result of some non-trivial operations. Power function may return NaN in case of any of those indeterminate forms: 0<sup>0</sup>, 1<sup>Infinity</sup>, Infinity<sup>0</sup>. Any operation which can result in complex number may return NaN in this case: log(-1.0f), sqrt(-1.0f), sin(2.0f) are the examples.

Lastly, any operation involving NaN as any of the operands always returns NaN. Because of that, NaN can sometimes quickly “spread” through data like a computer virus. The only exception is min or max: those functions should return the non-NaN argument. NaN is never equal to any other number, even itself (it can be used to test numbers against NaN).

Actual contents of NaN are implementation-defined; IEEE 754 only requires that exponent should be 11111111, significand should be non-zero (zero is reserved for the infinity) and the sign does not matter.

0/0 | -nan | S: 1 E: 11111111 M: 10000000000000000000000

IEEE 754 differentiates two types of NaN: quiet NaN and signaling NaN. Their only difference is that signaling NaN generates interruption while quiet NaN does not. Again, the application decides if it generates quiet NaN or signaling NaN. For instance, the GCC C compiler always generates quiet NaN unless explicitly specified to behave the other way around.

Conclusions

What can we learn from all the facts and experiments above? In any language operating with the floating-point data type, beware of the following:

– You should almost never directly compare two floating-point numbers unless you know what you are doing! A better way to do it is to compare it with some precision.

if (a == b) – wrong!

if (fabsf(a – b) < epsilon) – correct!

Floating-point numbers lose precision even when you are just working with such seemingly harmless numbers as 0.2 or 71.3. You should be extra careful when working with a large amount of floating-point operations over the same data: errors may build up rather quickly. If you are getting unexpected results and you suspect rounding errors, try to use a different approach, and minimize errors.

– In the world of floating-point arithmetic, multiplication is not associative: a * (b * c) is not always equal to (a * b) * c.

Additional measures should be taken if you are working with either extremely large values, extremely small numbers, and/or numbers close to zero: in case of overflow or underflow those values will be transformed into +Infinty, -Infinity or 0. Numeric limits for single-precision floating-point numbers are approximately 1.175494e-38 to 3.402823e+38 (1.4013e-45 to 3.402823e+38 if we also count denormalized numbers)а.

Beware if your system generates «quiet NaN». Sometimes, it may help you to not crash the application. Sometimes, it may spoil program execution beyond recognition.

Nowadays, floating-point numbers operations are extremely fast, with speed comparable to the usual integer arithmetic: a number of floating-point operations per second, or FLOPS, is perhaps the most well-known measure of computer performance. The only downside is that the programmer should be aware of all the pitfalls regarding the precision and special floating-point values.

About the Author

32-bit Single-Precision Floating Point in Details - ByteScout (1)ByteScout Team of WritersByteScout has a team of professional writers proficient in different technical topics. We select the best writers to cover interesting and trending topics for our readers. We love developers and we hope our articles help you learn about programming and programmers.

32-bit Single-Precision Floating Point in Details - ByteScout (2024)

FAQs

What is 32-bit single precision floating point? ›

Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

How accurate is 32-bit floating point? ›

32-bit single precision, with an approximate absolute normalized range of 0 and 10 -38 to 10 38 and with a precision of about 7 decimal digits. 64-bit double precision, with an approximate absolute normalized range of 0 and 10 -308 to 10 308 and with a precision of about 16 decimal digits.

What is the resolution of 32-bit floating point? ›

They both use 32 bit IEEE-754 floating point numbers (single precision). This format has a 24 bit mantissa (if you count the hidden bit), so the effective resolution is between one part in 223 (eight million) and one part in 224 (16 million). This supports six or seven decimal digits of resolution.

How many digits are in a 32-bit floating point? ›

32 bit floating point has 24 significant binary digits. log10224≈7.22 ⁡ 2 24 ≈ 7.22 gives a pretty good estimate of the equivalent number of decimal digits. that answer would be ≈ 7.225 decimal digits.

Is 32-bit float worth it? ›

Bit Depth and Why It Matters

The real benefit of 32-bit float recording actually has to do with its lower noise floor, and how that can greatly benefit your workflow as a recording engineer, videographer, or any other type of audio professional.

How big is 32-bit float? ›

In other words, above 0 dBFS or below it. So, 32-bit float technically may go up to +770 dBFS above 0 dBFS and go down to -758 dBFS below 0 dBFS. This gives effectively 1528 dB dynamic range. Not that it's even technically possible to produce such levels.

What is the downside of 32-bit float? ›

The first thing to understand is that while the use of floating point numbers means that a greater range can be captured, it is not as precise as expressing that same information in discrete integer values. Therefore, if a massive range is not required, then 32-bit float may not be the best choice.

How precise is float 32? ›

float32 has 24 bits of precision, so it has room to spare. However, if they are independent measurements but there are some weighting factors involved, it will not approach this ideal. If they are not of the same quantity, such that there is cancellation involved, you can instead have less precision than the inputs.

How many values can a 32-bit floating point store? ›

Using 32 bits up to 4,294,967,296 pieces of unique data can be stored. As signed integers, the range is -2,147,483,648 to 2,147,483,647.

What is the bias of 32-bit floating point? ›

For example, 32-bit floats have an exponent bitdepth of 8, so they have a bias of 127 (27 - 1). The bias is applied to the absolute value of the exponent depending on the exponent's value. The actual interpretation of the number depends on the value of the exponent.

What does 32-bit resolution mean? ›

32-bits allows the software and graphics card hardware to make a window or area show things that are behind. Windows uses this a great deal with it's "glass" effects. Windows 11 and 10 have areas such as the task bar where the alpha channel is used to define which pixels are transparent and how transparent they are.

What is the minimum value of a 32-bit floating point number? ›

Maximum value of exponent should be 255 – 127 = 128, and minimum value should be 0 – 127 = -127.

What is 32-bit single precision? ›

XDR defines the single-precision floating-point data type as a float. The length of a float is 32 bits, or 4 bytes. Floats are encoded using the IEEE standard for normalized single-precision floating-point numbers. Sign of the number.

How many bytes is a float in a 32-bit system? ›

4 bytes

What is a single precision? ›

Single Precision: Single Precision is a format proposed by IEEE for the representation of floating-point numbers. It occupies 32 bits in computer memory. 2. Double Precision: Double Precision is also a format given by IEEE for the representation of the floating-point number. It occupies 64 bits in computer memory.

What does float mean in 32-bit float? ›

32-bit float provides a vast increase in the available space to record louder sound. In simple terms, fixed point audio records its data values as simple numbers, whereas floating point math records its data as scientific notation. An example of clipping of a digital waveform.

What is meant by single precision floating point? ›

Single Precision: Single Precision is a format proposed by IEEE for the representation of floating-point numbers. It occupies 32 bits in computer memory. 2. Double Precision: Double Precision is also a format given by IEEE for the representation of the floating-point number. It occupies 64 bits in computer memory.

What is the difference between 24-bit and 32-bit floating-point? ›

Files recorded with 32-bit float record sound where 0 dBFS of the 32-bit file lines up with 0 dBFS of the 24- or 16-bit file. Keep in mind that unlike the 24- or 16-bit files, the 32-bit file goes up to +770 dBFS. So compared to a 24-bit WAV file, the 32-bit float WAV file has 770 dB more headroom.

What is 32-bit per channel floating-point? ›

Because of the nature of floating point type variables, 32bit float is roughly equal to 23-24 bits of precision per channel (16777216 values per channel in the 0-1 range), but with a much wider range (it can go far above 1), necessary for HDR/Scene-referred values. It is also known as 'single floating point'.

References

Top Articles
Hay Tedder For Sale Craigslist
Abby's Caribbean Cafe
These Walls Have Eyes Walkthrough - The Casting of Frank Stone Guide - IGN
Far-right activist Laura Loomer's access to Trump reveals a crisis in his campaign
RS3 Mining Training Guide - 1-99/120 | Gaming Elephant
Smoke Terminal Waterbury Photos
Void Client Vrchat
Seattle Clipper Vacations Ferry Terminal Amtrak
Cmx Cinemas Gift Card Balance
Barbershops near me in Jupiter
3rd Gen Acura TL Buyers’ Guide – Everything You Need to Know – Newparts.com
Netlearning Login Rwjbh
Ups Store Fax Cost
Leicht Perlig Biography
Indicafans
Do people over 65 pay federal income tax?
How to track your Amazon order on your phone or desktop
Heather Alicia Sims
Rhiel Funeral Durand
My Time Banner Health
11 Shows Your Mom Loved That You Should Probably Revisit
What is IXL and How Does it Work?
Gas Buddy Prices Near Me Zip Code
Mynorthwoodtech
Clay County Tax Collector Auto Middleburg Photos
636-730-9503
Car Stunt Games Unblocked
Bayada Bucks Catalog 2023
Lox Club Gift Code
Https Eresponse Tarrantcounty Com
Broncos vs. Seahawks: How to Watch NFL Week 1 Online Today
Nikki Porsche Girl Head
Riverry Studio
Horoscope Today: Astrological prediction September 9, 2024 for all zodiac signs
Wlox Jail Docket
Fuzz Bugs Factory Hop Halloween
Directions To Truist Bank Near Me
Lockstraps Net Worth
Owen Roeder Tim Dillon
Walmart Careers Com Online Application
DePaul joins nationwide pro-Palestinian college protests as encampment continues at University of Chicago
Barbarian Frenzy Build with the Horde of the Ninety Savages set (Patch 2.7.7 / Season 32)
Jacksonville Jaguars should be happy they won't see the old Deshaun Watson | Gene Frenette
Venti X Zhongli R34
Kronos.nyp
La Monja 2 Pelicula Completa Tokyvideo
Black Panther Pitbull Puppy For Sale
Cambridge Assessor Database
Builders Best Do It Center
David Knowles, journalist who helped make the Telegraph podcast Ukraine: The Latest a runaway success
Hurst Scott Funeral Home Obituaries Richlands Virginia
866-360-2863
Latest Posts
Article information

Author: Arline Emard IV

Last Updated:

Views: 6486

Rating: 4.1 / 5 (52 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Arline Emard IV

Birthday: 1996-07-10

Address: 8912 Hintz Shore, West Louie, AZ 69363-0747

Phone: +13454700762376

Job: Administration Technician

Hobby: Paintball, Horseback riding, Cycling, Running, Macrame, Playing musical instruments, Soapmaking

Introduction: My name is Arline Emard IV, I am a cheerful, gorgeous, colorful, joyous, excited, super, inquisitive person who loves writing and wants to share my knowledge and understanding with you.