Using semantic knowledge to improve compression on log files
- Authors: Otten, Frederick John
- Date: 2009 , 2008-11-19
- Subjects: Computer networks , Data compression (Computer science) , Semantics--Data processing
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4650 , http://hdl.handle.net/10962/d1006619 , Computer networks , Data compression (Computer science) , Semantics--Data processing
- Description: With the move towards global and multi-national companies, information technology infrastructure requirements are increasing. As the size of these computer networks increases, it becomes more and more difficult to monitor, control, and secure them. Networks consist of a number of diverse devices, sensors, and gateways which are often spread over large geographical areas. Each of these devices produce log files which need to be analysed and monitored to provide network security and satisfy regulations. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data for archival purposes after the log files have been rotated. However, there are many other compression programs which exist - each with their own advantages and disadvantages. These programs each use a different amount of memory and take different compression and decompression times to achieve different compression ratios. System log files also contain redundancy which is not necessarily exploited by standard compression programs. Log messages usually use a similar format with a defined syntax. In the log files, all the ASCII characters are not used and the messages contain certain "phrases" which often repeated. This thesis investigates the use of compression as a means of data reduction and how the use of semantic knowledge can improve data compression (also applying results to different scenarios that can occur in a distributed computing environment). It presents the results of a series of tests performed on different log files. It also examines the semantic knowledge which exists in maillog files and how it can be exploited to improve the compression results. The results from a series of text preprocessors which exploit this knowledge are presented and evaluated. These preprocessors include: one which replaces the timestamps and IP addresses with their binary equivalents and one which replaces words from a dictionary with unused ASCII characters. In this thesis, data compression is shown to be an effective method of data reduction producing up to 98 percent reduction in filesize on a corpus of log files. The use of preprocessors which exploit semantic knowledge results in up to 56 percent improvement in overall compression time and up to 32 percent reduction in compressed size. , TeX , pdfTeX-1.40.3
- Full Text:
- Date Issued: 2009
- Authors: Otten, Frederick John
- Date: 2009 , 2008-11-19
- Subjects: Computer networks , Data compression (Computer science) , Semantics--Data processing
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4650 , http://hdl.handle.net/10962/d1006619 , Computer networks , Data compression (Computer science) , Semantics--Data processing
- Description: With the move towards global and multi-national companies, information technology infrastructure requirements are increasing. As the size of these computer networks increases, it becomes more and more difficult to monitor, control, and secure them. Networks consist of a number of diverse devices, sensors, and gateways which are often spread over large geographical areas. Each of these devices produce log files which need to be analysed and monitored to provide network security and satisfy regulations. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data for archival purposes after the log files have been rotated. However, there are many other compression programs which exist - each with their own advantages and disadvantages. These programs each use a different amount of memory and take different compression and decompression times to achieve different compression ratios. System log files also contain redundancy which is not necessarily exploited by standard compression programs. Log messages usually use a similar format with a defined syntax. In the log files, all the ASCII characters are not used and the messages contain certain "phrases" which often repeated. This thesis investigates the use of compression as a means of data reduction and how the use of semantic knowledge can improve data compression (also applying results to different scenarios that can occur in a distributed computing environment). It presents the results of a series of tests performed on different log files. It also examines the semantic knowledge which exists in maillog files and how it can be exploited to improve the compression results. The results from a series of text preprocessors which exploit this knowledge are presented and evaluated. These preprocessors include: one which replaces the timestamps and IP addresses with their binary equivalents and one which replaces words from a dictionary with unused ASCII characters. In this thesis, data compression is shown to be an effective method of data reduction producing up to 98 percent reduction in filesize on a corpus of log files. The use of preprocessors which exploit semantic knowledge results in up to 56 percent improvement in overall compression time and up to 32 percent reduction in compressed size. , TeX , pdfTeX-1.40.3
- Full Text:
- Date Issued: 2009
Pseudo-random access compressed archive for security log data
- Authors: Radley, Johannes Jurgens
- Date: 2015
- Subjects: Computer security , Information storage and retrieval systems , Data compression (Computer science)
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4723 , http://hdl.handle.net/10962/d1020019
- Description: We are surrounded by an increasing number of devices and applications that produce a huge quantity of machine generated data. Almost all the machine data contains some element of security information that can be used to discover, monitor and investigate security events.The work proposes a pseudo-random access compressed storage method for log data to be used with an information retrieval system that in turn provides the ability to search and correlate log data and the corresponding events. We explain the method for converting log files into distinct events and storing the events in a compressed file. This yields an entry identifier for each log entry that provides a pointer that can be used by indexing methods. The research also evaluates the compression performance penalties encountered by using this storage system, including decreased compression ratio, as well as increased compression and decompression times.
- Full Text:
- Date Issued: 2015
- Authors: Radley, Johannes Jurgens
- Date: 2015
- Subjects: Computer security , Information storage and retrieval systems , Data compression (Computer science)
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4723 , http://hdl.handle.net/10962/d1020019
- Description: We are surrounded by an increasing number of devices and applications that produce a huge quantity of machine generated data. Almost all the machine data contains some element of security information that can be used to discover, monitor and investigate security events.The work proposes a pseudo-random access compressed storage method for log data to be used with an information retrieval system that in turn provides the ability to search and correlate log data and the corresponding events. We explain the method for converting log files into distinct events and storing the events in a compressed file. This yields an entry identifier for each log entry that provides a pointer that can be used by indexing methods. The research also evaluates the compression performance penalties encountered by using this storage system, including decreased compression ratio, as well as increased compression and decompression times.
- Full Text:
- Date Issued: 2015
- «
- ‹
- 1
- ›
- »