AllExperts > Encyclopedia 
Search      
Find out about volunteering to AllExperts

Data compression: Encyclopedia BETA


Free Encyclopedia
 Home · Index · Browse A-Z  · Questions and Answers ·
Encyclopedia

Browse A-Z
ABCDEFGHIJKLMNOPQRSTUVWXYZNum


License
Disclaimer

 
 
 
 
Free Online Courses
12 Weeks to Weight Loss
Take Charge of Stress
Learn How to Bake
Budgeting 101
Deeper Faith
DIY Fashion Makeover

       MORE E-COURSES
 
   

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z  Misc

Data compression

In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. For example, this article could be encoded with fewer bits if we accept the convention that the word "compression" be encoded as "comp". One popular instance of compression that many computer users are familiar with is the ZIP file format, which, as well as providing compression, acts as an archiver, storing many files in a single output file.

As is the case with any form of communication, compressed data communication only works when both the sender and receiver of the information understand the encoding scheme. For example, this text makes sense only if the receiver understands that it is intended to be interpreted as characters representing the English language. Similarly, compressed data can only be understood if the decoding method is known by the receiver. Some compression algorithms exploit this property in order to encrypt data during the compression process so that decompression can only be achieved by an authorized party (eg. through the use of a password).

Compression is possible because most real-world data has statistical redundancy. For example, the letter 'e' is much more common in English text than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is rather small. Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely, but nevertheless perfectly.

Further compression is possible if some loss of fidelity is allowable. For example, a person viewing a picture or television video scene might not notice if some of its finest details are removed or not represented perfectly. Similarly, two strings of samples representing an audio recording may sound the same but actually not be exactly the same. Lossy compression algorithms introduce relatively minor differences and represent the picture, video, or audio using fewer bits.

Compression is important because it helps reduce the consumption of expensive resources, such as disk space or connection bandwidth. However, compression requires information processing power, which can also be expensive. The design of data compression schemes therefore involves trade-offs between various factors including compression capability, any amount of introduced distortion, computational resource requirements, and often other considerations as well.

Some schemes are reversible so that the original data can be reconstructed (lossless data compression), while others accept some loss of data in order to achieve higher compression (lossy data compression).

However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually result in an expansion, as will attempts to compress encrypted data.

In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, which for example always removes the last byte of a file, will always compress a file up to the point where it is empty.

Applications

One very simple means of compression is run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. This is an example of lossless data compression. It is often used to better use disk space on office computers, or better use the connection bandwidth in a computer network. For symbolic data such as spreadsheets, text, executable programs, etc., losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases).

For visual and audio data, some loss of quality can be tolerated without losing the essential nature of the data. By taking advantage of limitations of the human sensory system, a great deal of space can be saved while producing output which is nearly indistinguishable from the original. These lossy data compression methods typically offer a three-way tradeoff between compression speed, compressed data size and quality loss.

Lossy image compression is used in digital cameras, greatly increasing their storage capacities while hardly degrading picture quality at all. Similarly, DVDs use the lossy MPEG-2 codec for video compression.

In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline than "audio compression". Different audio and speech compression standards are listed under audio codecs. Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by MP3 players.

Theory

The theoretical background of compression is provided by information theory (which is closely related to algorithmic information theory) and by rate-distortion theory. These fields of study were essentially created by Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s. Doyle and Carlson (2000) wrote that data compression "has one of the simplest and most elegant design theories in all of engineering". Cryptography and coding theory are also closely related. The idea of data compression is deeply connected with statistical inference.

Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy data compression systems typically include even more stages, including, for example, prediction, frequency transformation, and quantization.

The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, although compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. LZW (Lempel-Ziv-Welch) was patented by Unisys until June of 2003, and is used in GIF images. Also noteworthy are the LZR (LZ-Renau) methods, which serve as the basis of the Zip method. LZ methods utilize a table based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX).A current LZ based coding scheme that performs well is LZX, used in Microsoft's CAB format.

The very best compressors use probabilistic models whose predictions are coupled to an algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Arithmetic coding is used in the bilevel image-compression standard JBIG, and the document-compression standard DjVu. The text entry system, Dasher, is an inverse-arithmetic-coder.

See also

Data compression topics

* algorithmic complexity theory
* information entropy
* self-extraction
* image compression
* speech compression
* video compression
* multimedia compression
* minimum description length
* minimum message length (two-part lossless compression designed for inference)
* List of archive formats
* List of file archivers
* Comparison of file archivers
* List of Unix programs
* Free file format

Compression algorithms

Lossless data compression

* run-length encoding
* dictionary coders
** LZ77 & LZ78
** LZW
* Burrows-Wheeler transform
* prediction by partial matching (also known as PPM)
* context mixing
* entropy encoding
** Huffman coding (simple entropy coding; commonly used as the final stage of compression)
** Adaptive Huffman coding
** arithmetic coding (more advanced)
***Shannon-Fano_coding
***range encoding (same as arithmetic coding, but looked at in a slightly different way)
** T-code, A variant of Huffman code
** Golomb coding (simple entropy coding for infinite input data with a geometric distribution)
** universal codes (entropy coding for infinite input data with an arbitrary distribution)
*** Elias gamma coding
*** Fibonacci coding

Lossy data compression

* discrete cosine transform
* fractal compression
** fractal transform
* wavelet compression
* vector quantization
* linear predictive coding
* Distributed Source Coding Using Syndromes, for correlated data
* Modulo-N code for correlated data
* A-law Compander
* Mu-law Compander

Example Implementations

* DEFLATE (a combination of LZ77 and Huffman coding) – used by ZIP, gzip and PNG files
* LZMA used by 7-Zip and StuffitX
* LZO (very fast LZ variation, speed oriented)
* LZX (an LZ77 family compression algorithm)
* Unix compress utility (the .Z file format), and GIF use LZW
* bzip2 (a combination of the Burrows-Wheeler transform and Huffman coding)
* PAQ (very high compression based on context mixing, but extremely slow; competing in the top of the highest compression competitions)
* JPEG (image compression using a discrete cosine transform, then quantization, then Huffman coding)
* MPEG (audio and video compression standards family in wide use, using DCT and motion-compensated prediction for video)
** MP3 (a part of the MPEG-1 standard for sound and music compression, using subbanding and MDCT, perceptual modeling, quantization, and Huffman coding)
** AAC (part of the MPEG-2 and MPEG-4 audio coding specifications, using MDCT, perceptual modeling, quantization, and Huffman coding)
* Vorbis (DCT based AAC-alike audio codec, designed with a focus on avoiding patent encumbrance)
* JPEG 2000 (image compression using wavelets, then quantization, then entropy coding)
* TTA (uses linear predictive coding for lossless audio compression)
* FLAC (linear predictive coding for lossless audio compression)

External links

*Data Compression Benchmarks and Tests
*Data Compression - Systematisation by T.Strutz
*Public domain article on data compression
*How Compression Works
*Ultimate Command Line Compressors
*Compression Resources catalog (currently the biggest)
*The Data Compression News Blog
*Practical Compressor Test (Compares speed and efficiency for commonly used compression programs)
*The Monthly Data Compression Newsletter
*Compressed File Types and File Extensions



  Rate this Article
   Was this article helpful?
Not at allDefinitely              
   12345  

Email this page
About Us | Advertise on This Site | User Agreement | Privacy Policy | Kids' Privacy Policy | Help
About and About.com are registered trademarks of About, Inc. The About logo is a trademark of About, Inc. All rights reserved.
This is the "GNU Free Documentation License" reference article from the English Wikipedia. All text is available under the terms of the GNU Free Documentation License. See also our Disclaimer.