- Title
- Evaluating text preprocessing to improve compression on maillogs
- Creator
- Otten, Fred, Irwin, Barry V W, Thinyane, Hannah
- Subject
- To be catalogued
- Date
- 2009
- Type
- text
- Type
- article
- Identifier
- http://hdl.handle.net/10962/430138
- Identifier
- vital:72668
- Identifier
- https://doi.org/10.1145/1632149.1632157
- Description
- Maillogs contain important information about mail which has been sent or received. This information can be used for statistical purposes, to help prevent viruses or to help prevent SPAM. In order to satisfy regula-tions and follow good security practices, maillogs need to be monitored and archived. Since there is a large quantity of data, some form of data reduction is necessary. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data. Text preprocessing can be used to aid the compression of English text files. This paper evaluates whether text preprocessing, particularly word replacement, can be used to improve the compression of maillogs. It presents an algorithm for constructing a dictionary for word replacement and provides the results of experiments conducted using the ppmd, gzip, bzip2 and 7zip programs. These tests show that text prepro-cessing improves data compression on maillogs. Improvements of up to 56 percent in compression time and up to 32 percent in compression ratio are achieved. It also shows that a dictionary may be generated and used on other maillogs to yield reductions within half a percent of the results achieved for the maillog used to generate the dictionary.
- Format
- 9 pages, pdf
- Language
- English
- Relation
- Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists, Otten, F., Irwin, B. and Thinyane, H., 2009, October. Evaluating text preprocessing to improve compression on maillogs. In Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists (pp. 44-53), Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists volume 2009 number 1 44 53 2009 978-1-60558-643-4
- Rights
- Publisher
- Rights
- Use of this resource is governed by the terms and conditions of the ACM Digital Library Statement (https://libraries.acm.org/digital-library/policies#anchor3)
- Hits: 166
- Visitors: 171
- Downloads: 9
Thumbnail | File | Description | Size | Format | |||
---|---|---|---|---|---|---|---|
View Details | SOURCE1 | Evaluating text preprocessing to improve compression on maillogs.pdf | 693 KB | Adobe Acrobat PDF | View Details |