The honeypot is a method of cybersecurity in which a bait (‘honey’) system/network is designed to emulate or act as a real system/network to divert malicious attacks upon the actual real system/network. The honeypot may act to mitigate, block, and in some cases capture the malicious behavior. The concept of the honeypot probably originated from two books, “The Cuckoos Egg” by Clifford Stoll and “An Evening with Berferd” by Bill Chewick, both describing the authors’ own personal efforts to catch computer hackers. The first publically available honeypot was Fred Cohen's Deception ToolKit in 1998; since then, as the prevalence of malicious network attacks has increased, so has the use and the sophistication of honeypots.
Honeypot design and deployment is a tradeoff between realism and simplicity; this tradeoff can be characterized as the difference between high and low interaction honeypots. A realistic design could use an actual operating system instrumented to detect and capture intruders (known as a high interaction honeypot). However, the detection would be greatly complicated, because it is difficult to distinguish between normal traffic on the system and the attacker's. It is a low signal to noise detection problem due to the complexity of modern operating systems running hundreds of threads generating large volumes of traffic with complex signatures. A honeypot that is designed only to superficially mimic an OS (low interaction honeypot) can easily detect the attacker's actions, since there is no background noise. Unfortunately, the attacker can also recognize it as a decoy because of its inherent simplicity and shallowness. While low interaction honeypots have evolved to mitigate the possibility of detection by implementing protocols more completely, this approach has been deemed futile by some researchers, because the attacker can more easily detect honeypots than the defender can create plausible simulacrum. In fact, it can be argued that low-interaction honeypots can never be fully undetectable to attackers, as, by definition, they only partially simulate/emulate a service to be attacked.
Specifically, there are some known methods for detecting low-interaction honeypots, with some of these methods being quite trivial and obvious. For example, the default setting for Beartrap, an ftp-based low interaction honeypot, always returns an identifying banner, ‘‘220 BearTrap-ftps Service ready.” Conpot, another common honeypot based on scada emulation, has the same implementation name (“Mouser Factory”) and the same serial number set as a default. Aside from these obvious examples, one common issue is the behavior of honeypots trying to emulate certain services that does not match the actual behavior of those services. For example, Honeyd, a platform to establish multiple virtual honeypots of different server types, has service scripts for IIS and Linux/FTP. However, the response of GET from IIS service script returns the same response, with an abnormally long time elapsed from the latest update, and the Linus/FTP service script does not support the DELE command. Another example comes from Nova, which expands upon honeyd to create emulations of complete machines. Their Windows machine default configuration does not have a NetBIOS service script, and thus it displays an open port and allows connection but does not implement the service. These behaviors are clear indicators to attackers of honeypots, as opposed to real machines.
A high interaction honeypot generally will not have the ease of detection of a low interaction honeypot, but can also pose a threat. It is designed to allow infection, but unless the intrusion is detected quickly it can become a vector of attack to other systems. Running the honeypot in a virtual machine can protect against the malicious attacker, but this mode of operation can be detected, neutralizing its effectiveness while incurring substantial operating costs over low interaction honeypots. Also, as mentioned above, it difficult to determine actual attacks from normal activity (i.e., the low signal-to-noise issue) on a high interaction honeypot.
Machine learning holds the promise of realistically simulating protocols in a way that fools the attacker but does not compromise the system. This problem can be framed as a version of the Turing test, where the attacker is querying the system to see if it is a decoy. In the Turing test, an interaction between a human and a machine is observed to see if the machine displays sufficient skill in imitating a human. Here we need to imitate a protocol. Large amounts of mined protocol command and responses would serve as the training input to these machine learning systems similar to what is done in building language translators and chatbots.
A machine learning system capable of understanding the patterns of sequences in time is a natural fit to a type of recurrent neural network called an LSTM (long short-term memory). They have been used to generated artificial examples of mathematical texts (algebraic geometry in Latex). Here we are interested in producing plausible responses to protocol requests such as ls or cd in ssh. LSTMs have an internal state which lets them remember the past but with an intelligent forgetting factor allowing them to continue to learn without saturating memory. It is plausible that these types of network can be automatically trained and deployed at a much lower cost that the humans needed to continually patch the deficiencies in honeypot verisimilitude.
There are a few aspects of using honeypots that become clear from this discussion. First, if one uses a honeypot, be sure to avoid the default configurations of these honeypots whenever possible. Second, attempt to design the service script behavior to match the expectations of the attacker. For example, in the case of the IIS GET response of the honeyd script, one could return an empty dir list and randomize time stamps, byte counts, volume serial number. More generally, one might consider an intelligent algorithm or approach to change or mutate a honeypot from a detectable back to an undetectable form.
Arshak Navruzyan, Steve Shimozaki
Anagnostakis, Kostas G., et al. Detecting Targeted Attacks Using Shadow Honeypots.Usenix Security. 2005.
Bossert, Georges, Frédéric Guihéry, and Guillaume Hiet. Towards automated protocol reverse engineering using semantic information. Proceedings of the 9th ACM symposium on Information, computer and communications security. ACM, 2014.
Buczak, Anna L., and Erhan Guven. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials 18.2 (2015): 1153-1176.
Buda, Michał, and Ilona Bluemke. Data Mining Algorithms in the Analysis of Security Logs from a Honeypot System. Dependability Engineering and Complex Systems. Springer International Publishing, 2016. 63-73.
Cuckoo Sandbox malware analysis system.
Dahl, George E., et al. Large-scale malware classification using random projections and neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013.
Franc, Vojtech, Michal Sofka, and Karel Bartos. Learning detector of malicious network traffic from weak labels. Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer International Publishing, 2015.
Haltaş, Fatih, et al. An automated bot detection system through honeypots for large-scale Cyber Conflict (CyCon 2014), 2014 6th International Conference On. IEEE, 2014.
Nawrocki, Marcin, et al. A Survey on Honeypot Software and Data Analysis. arXiv preprint arXiv:1608.06249 (2016).
Netzob an open source tool for reverse engineering, traffic generation and fuzzing of communication protocols.
Saeed, Imtithal A., Ali Selamat, and Ali MA Abuagoub. A survey on malware and malware detection systems. International Journal of Computer Applications 67.16 (2013).
Security showcase Open source projects to help build and operate more secure systems, along with tools for security monitoring and incident response.
Whalen, Sean, Matt Bishop, and James P. Crutchfield. Hidden markov models for automated protocol learning. International Conference on Security and Privacy in Communication Systems. Springer Berlin Heidelberg, 2010.