In the field of numeric computation, computers perform
mathematical operations on real-world data. However,
because computers represent numbers in a finite manner,
special techniques are used to handle issues such as
floating-point arithmetic, errors, and iterative methods
for solving equations. This unit will cover the basic
concepts of floating-point numbers, their operations, error
analysis, and some important iterative methods for solving
equations numerically.
1. Computer Arithmetic and Floating Point Numbers
Floating-point arithmetic is used to represent real numbers
in a way that can accommodate a wide range of values by
using a combination of a mantissa and an exponent. This
representation allows for both very small and very large
numbers, but it also introduces the possibility of errors due
to the finite precision of the system.
1.1 Floating-Point Representation
, A floating-point number is generally represented as:
For example, the number 123.456 can be represented in
scientific notation as:
In binary (base-2), this becomes:
This is how floating-point numbers are represented in
computers, using a sign bit, exponent, and mantissa. A
floating-point number is limited by the number of bits
used to store the exponent and mantissa.
1.2 Floating Point Operations
The operations on floating-point numbers, such as addition,
subtraction, multiplication, and division, are carried out by
performing these operations on the mantissas, adjusting
the exponent accordingly.
However, because floating-point numbers have limited
precision, errors may arise during these operations due to
rounding. For instance, small discrepancies can occur
when a result is truncated or rounded to fit within the
available precision of the computer.