Introduction - University of Arizona

advertisement
LING 408/508: Programming for
Linguists
Lecture 2
August 26th
Today’s Topics
• continuing on from last time …
• Homework 1
Adminstrivia
• No class on
– Monday September 7th (Labor Day)
– Wednesday November 11th (Veterans Day)
– Week after September 11th (out of town), plus Monday 21st
– Monday October 12th
Introduction: data types
• what if you want to store even larger numbers
than 32 bits?
– Binary Coded Decimal (BCD)
– 1 byte can code two digits (0-9 requires 4 bits)
– 1 nibble (4 bits) codes the sign (+/-), e.g. hex C/D
23
22
21
20
0
0
0
0
23
22
21
20
0
0
0
1
23
22
21
20
1
0
0
1
2
0
0
1
4
2 bytes (= 4 nibbles)
1
+
2
0
1
4
2.5 bytes (= 5 nibbles)
9
23
1
credit (+)
22 21
1
0
20
0
C
23
debit (-)
22 21 20
1
1
0
1
D
Introduction: data types
e.g. probabilities
• Typically, 64 bits (8 bytes) are used to represent
C:
floating point numbers (double precision)
float
double
– c = 2.99792458 x
– coefficient: 52 bits (implied 1, therefore treat as 53)
– exponent: 11 bits (usually not 2’s complement,
unsigned with bias 2(10-1)-1 = 511) x86 CPUs have a built-in
floating point coprocessor (x87)
– sign: 1 bit (+/-)
80 bit long registers
108 (m/s)
wikipedia
Introduction: data types
• Next time, we'll talk about the representation
of characters (letters, symbols, etc.)
Example 1
• Recall the speed of light:
• c = 2.99792458 x 108 (m/s)
1. Can a 4 byte integer be used to represent c
exactly?
– 4 bytes = 32 bits
– 32 bits in 2’s complement format
– Largest positive number is
– 231-1 = 2,147,483,647
–
c = 299,792,458
Example 2
• Recall the speed of light:
• c = 2.99792458 x 108 (m/s)
2. How much memory would you need to
encode c using BCD notation?
– 9 digits
– each digit requires 4 bits (a nibble)
– BCD notation includes a sign nibble
– total is 5 bytes
Example 3
• Recall the speed of light:
• c = 2.99792458 x 108 (m/s)
3. Can the 64 bit floating point representation
(double) encode c without loss of precision?
– Recall significand precision: 53 bits (52 explicitly
stored)
– 253-1 = 9,007,199,254,740,991
– almost 16 digits
Example 4
• Recall the speed of light:
• c = 2.99792458 x 108 (m/s)
• The 32 bit floating point representation (float) –
sometimes called single precision - is composed
of 1 bit sign, 8 bits exponent (unsigned with bias
2(8-1)-1), and 23 bits coefficient (24 bits effective).
• Can it represent c without loss of precision?
– 224-1 = 16,777,215
– Nope
Homework 1
• For both solutions, show your work, i.e. how
you derived your answer
• Pi (𝛑) is an irrational number
– can't be represented precisely!
wikipedia
Homework 1
1. Encode Pi as accurately as possible using both
the 64
and 32 bit floating point representations
Instruction: draw the diagram and fill in the 1's and 0's
2. How many decimal places of precision is
provided by each of the 64 and 32 bit floating
point representations?
Homework 1 Hints
• How to encode 1: (bias: 01111 + 0 = 20, frac:
1000… remember: there is an implicit leading 1,
• = 1.000… in binary)
Homework 1 Hints
• How to encode 2: (exp: 10000 = bias 01111 +
1 = 21, frac: 1000…) = 10.00… in binary
Homework 1 Hints
• How to encode 3: (exp: 10000 = bias 01111 +
1 = 21, frac: 1100…) = 11.000… in binary
Homework 1 Hints
• How to encode 4: (exp: 10001 = bias 01111 +
10 = 22, frac: 1000…) = 100.0… in binary
Homework 1 Hints
• How to encode 5: (exp: 10001 = bias 01111 +
10 = 22, frac: 1010…) = 101.0… in binary
Homework 1 Hints
• How to encode 6: (exp: 10001 = bias 01111 +
10 = 22, frac: 1100…) = 110.0… in binary
Homework 1 Hints
• How to encode 7: (exp: 10001 = bias 01111 +
10 = 22, frac: 1110…) = 111.0… in binary
Homework 1 Hints
• How to encode 8: (exp: 10001 = bias 01111 +
100 = 23, frac: 1000…) = 1000.0… in binary
Homework 1 Hints
• Decimal 3.5 is 1.11 x 21 = 11.1 in binary
Homework 1 Hints
• Decimal 3.25 is 1.101 x 21 = 11.01 in binary
Homework 1 Hints
• Decimal 3.125 is 1.1001 x 21 = 11.001 in binary
Homework 1
• Due Friday night
– (by midnight in my emailbox)
• Required format (for all homeworks unless
otherwise specified):
– Plain text or PDF formats only
• (no .doc, .docx etc.)
– Single file only – cut and paste into one document
• (no multiple attachments)
– Subject line: 408/508 Homework 1
– First line: your full name
Introduction: data types
• How about letters, punctuation, etc.?
• ASCII
C:
char
– American Standard Code for Information Interchange
– Based on English alphabet (upper and lower case) + space + digits +
punctuation + control (Teletype Model 33)
– Question: how many bits do we need?
– 7 bits + 1 bit parity
– Remember everything is in binary …
Teletype Model 33 ASR
Teleprinter (Wikipedia)
Introduction: data types
order is important in sorting!
0-9: there’s a connection with BCD. Notice: code 30 (hex) through 39 (hex)
Introduction: data types
• Parity bit:
–
–
–
–
transmission can be noisy
parity bit can be added to ASCII code
can spot single bit transmission errors
even/odd parity:
x86 assemby language:
1. PF: even parity flag set by
arithmetic ops.
2. TEST: AND (don’t store
result), sets PF
3. JP: jump if PF set
• receiver understands each byte should be even/odd
– Example:
• 0 (zero) is ASCII 30 (hex) = 011000
• even parity: 0110000, odd parity: 0110001
– Checking parity:
• Exclusive or (XOR): basic machine instruction
– A xor B true if either A or B true but not both
Example:
MOV al,<char>
TEST al, al
JP <location if even>
<go here if odd>
– Example:
• (even parity 0) 0110000 xor bit by bit
• 0 xor 1 = 1 xor 1 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0
Introduction: data types
• UTF-8
– standard in the post-ASCII world
– backwards compatible with ASCII
– (previously, different languages had multi-byte character sets that
clashed)
– Universal Character Set (UCS) Transformation Format 8-bits
(Wikipedia)
Introduction: data types
• Example:
–
–
–
–
–
あ Hiragana letter A: UTF-8: E38182
Byte 1: E = 1110, 3 = 0011
Byte 2: 8 = 1000, 1 = 0001
Byte 3: 8 = 1000, 2 = 0010
い Hiragana letter I: UTF-8: E38184
Shift-JIS (Hex):
あ: 82A0
い: 82A2
Introduction: data types
• How can you tell what encoding your file is using?
• Detecting UTF-8
– Microsoft:
• 1st three bytes in the file is EF BB BF
• (not all software understands this; not everybody uses it)
– HTML:
• <meta http-equiv="Content-Type"
content="text/html;charset=UTF-8" >
• (not always present)
– Analyze the file:
• Find non-valid UTF-8 sequences: if found, not UTF-8…
• Interesting paper:
– http://wwwarchive.mozilla.org/projects/intl/UniversalCharsetDetection.html
Introduction: data types
• Filesystem:
– different on different computers: sometimes a problem if
you mount filesystems across different systems
• Examples:
– FAT32 (File Allocation Table)
–
–
–
–
DOS, Windows,
limited to 4GB max file size
memory cards
ExFAT (Extended FAT)
SD cards (> 4GB files)
NTFS (New Technology File System) Windows
ext4 (Fourth Extended Filesystem)
Linux
HFS+ (Hierarchical File System Plus) Macs
Introduction: data types
• Filesystem:
– different on different computers: sometimes a problem if you mount
filesystems across different systems
• Files:
–
–
–
–
–
–
–
Name
(Path from / root)
Type
(e.g. .docx, .pptx, .pdf, .html, .txt)
Owner
(usually the Creator)
Permissions
(for the Owner, Group, or Everyone)
need to be opened (to read from or write to)
Mode: read/write/append
in all programming languages:
Binary/Text
open command
Introduction: data types
• Text files:
– text files have lines: how do we mark the end of a line?
– End of line (EOL) control character(s):
• LF
0x0A
(Mac/Linux),
• CR
0x0D
(Old Macs),
• CR+LF 0x0D0A (Windows)
– End of file (EOF) control character:
• (EOT) 0x04 (aka Control-D)
programming languages:
NUL used to mark
the end of a string
binaryvision.nl
Download