Current Research Projects and Activities

Current Research Statement

Our research interests are in the areas of networks security and online privacy. Our work has broadly combined principles of the design, analysis, and development of security and privacy primitives and tools for various systems. Over the past ten years, our interests have evolved to include topics in (big and small data) security analytics, social networks security and privacy, Internet security, networks security, and privacy. Our approach in conducting research in these areas considered exploratory, constructive, and empirical methods. A common theme in our most recent research work is the use of advanced machine learning techniques for security analytics: to understand codes, traffic, and infrastructure usage in real-world deployments. Our earlier work focused on understanding various security issues in multiple networking contexts, by design and analysis.

Adversarial and Applied Machine Learning

Conventional machine learning approaches: Up until recently, the majority of our work has been focused on using conventional machine learning approaches, including supervised and unsupervised learning, in order to classify and automate the process of labeling threat indicators (such as domain names, binaries, vulnerability severity label) as well as prediction (time-series type of data). Supervised learning algorithms used include SVM, MLP, RF, ANFIS, among others. Unsupervised learning algorithms include k-mean, fuzzy c-means, and hierarchical clustering. Some of the recent of the problems we solved using conventional machine learning algorithms include the build of Internet of Things malicious software detectors, vulnerability severity score predictor and labeling system, a semi-supervised detector of cryptojacking codes (type of malicious codes used for abusing computer systems for cryptomining), malicious webpage classification system (to annotate malicious webpages based on capabilities and compromise vector), vulnerability cost assessment system (by stock performance prediction using ARMIA), among others.

Deep learning approaches: As the complexity and size of the data increased we utilized different deep learning algorithms for both feature extraction as well as pattern recognition. Deep learning algorithms benefit from automatic feature extraction and learning which not only improves the performance of the model by extracting more meaningful features, but also eliminates the need for feature extraction phase in conventional machine learning algorithms, which is laborious and require domain knowledge. We utilized convolutional neural network to build an Internet of Things malicious software vendors, intrusion detection system in software defined networking, website fingerprinting (for improving privacy), authorship identification (for identifying malicious codes authors and documents forgers), and binaries classification.

Adversarial machine learning approaches: adversarial learning is concerned with generated input samples similar to original ones (with simple perturbations) that would result in fooling machine learning algorithms (e.g., result in misclassification) and can be used for improving the robustness of machine learning algorithms, highlighting the risk of machine learning algorithms through purposeful attacks, and understanding practical limitations of such algorithms. Algorithms used for adversarial learning include MIM, FGSM, JSMA, PGD, DeepFool, NewtonFool, etc. Problems that benefited from adversarial machine learning approaches include generating practical malware samples that will not only fool classifiers but also be executable, intrusion detection in software defined networks, and website fingerprinting.

Representative Publications

  • PDF Multi-χ: Identifying Multiple Authors from Source Code Files. Mohammed Abuhamad, Tamer Abuhmed, DaeHun Nyang, and David Mohaisen. Privacy Enhancing Technologies Symposium (PoPETS/PETS), 2020.
  • PDF Soteria: Detecting Adversarial Examples in Control Flow Graph-based Malware Classifiers Hisham Alasmary, Ahmed Abusnaina, Rhongho Jang, Mohammed Abuhamad, Afsah Anwar, Daehun Nyang, and David Mohaisen. The 40th IEEE International Conference on Distributed Computing Systems (IEEE ICDCS 2020)
  • PDF Large-Scale and Language-Oblivious Code Authorship Identification. M. Abuhamed, T. Abuhamed, A. Mohaisen, D. Nyang: ACM SIGSAC Conference on Computer and Communications Security (ACM CCS 2018)

Secure and Reliable Systems with Blockchains

Our work on blockchains covers a range of topics, from primitives and foundations to applications and translations. More precisely, he has been leading three thrusts of research: 1) foundational and principled research into distributed systems primitives (consensus algorithms) that would ensure desirable properties in blockchain systems, such as privacy, fairness, decentralization, 2) distributed systems requirements and their translation into a blockchain framework by combining requirements engineering and composable designs, and 3) sustainability of system properties in the new ecosystem through active measurements (predictive models) and design evolution of alternatives and trade-offs. Related to the last thrust, we have been working on understanding the abuse of blockchains through a system attack surface analysis.

Representative Publications

  • PDF Towards Characterizing Blockchain-based Cryptocurrencies for Highly-Accurate Predictions. M. Saad, J. Choi, J. Kim, D. Nyang, A Mohaisen. IEEE Systems Journal (IEEE ISJ 2020) Best Paper Award
  • PDF Exploring the Attack Surface of Blockchain: A Systematic Overview. . Muhammad Saad, Jeffrey Spaulding, L. Njilla, C. A. Kamhoua, S. Shetty, D. Nyang, Aziz Mohaisen: IEEE Communication Surveys and Tutorials (IEEE CS&T 2020).
  • PDF Exploring Spatial, Temporal, and Logical Attacks on the Bitcoin Network. M. Saad, V. Cook, L. Nguyen, My Thai, A. Mohaisen: 39th IEEE International Conference on Distributed Computing Systems(IEEE ICDCS 2019)

Mobile and Internet of Things Security and Privacy

>Mobile security threats have recently emerged because of the fast growth in mobile technologies and the essential role that mobile devices play in our daily lives. For that, and to particularly address threats associated with malware, various techniques are developed in the literature, including ones that utilize static, dynamic, on-device, off-device, and hybrid approaches for identifying, classifying, and defend against mobile threats. Those techniques fail at times, and succeed at other times, while creating a trade-off of performance and operation. To this end, we contribute several systems: Andro-AutoPsy, AndroTracker, Andro-Dumpsys, and APHunter. In summary, we design efficent and accurate techniques for detecting and classifying mobile malware, techniques for improving privacy in mobile networks, as well as tecniques for detecting hardware malicious access points.

In our work on IoT security, we systemize for a finer understanding of the security threats in smart home networks and propose to perform a comprehensive multi-layer and cross-layer analysis, recommendations, and design of primitives and functions for securing the home network. Towards that, the proposed research explores a quantification of the attack surface of home network devices, network gears, and services, to guide a system-aware design of a security layer that incorporates primitives at the device, network, and service layer of the home network for intrusion detection and prevention. For prevention, the security layer features various functions such as secure naming and resolution, safe cryptographic primitives, including identification and authentication, and various other system layer-specific primitives. For intrusion detection, the cornerstone of our security layer is a behavioral logging and profiling capability at the device, network, and service, to facilitate real-time intrusion detection and notification. In summary, in this research theme we develop algorithms to improve the efficiency, security, and operation of mobile and wireless networks, including usable authentication techniques, and intrusion detection mechanisms, among others, for use in IoT applications. En route, we also explore various aspects of IoT privacy.

Representative Publications

  • PDF AUToSen: Deep Learning-based Implicit Continuous Authentication Using Smartphone Sensors. Mohammed Abuhamad, Tamer Abuhmed, David Mohaisen, and DaeHun Nyang. the IEEE Internet of Things Journal (IEEE IoTJ 2020)
  • PDF Catch Me If You Can: Rogue Access Point Detection Using Intentional Channel Interference. R. Jang, J. Kang, A. Mohaisen, D. Nyang. IEEE Transactions on Mobile Computing (IEEE TMC 2019)
  • PDF XLF: A Cross-layer Framework to Secure the Internet of Things (IoT). An Wang, Aziz Mohaisen and Songqing Chen. 39th IEEE International Conference on Distributed Computing Systems (IEEE ICDCS 2019)

Distributed Denial of Service Attacks and Defenses

Analyzing and understanding distributed denial of service (DDoS) attacks is another thrust of my work. Enormous efforts are continuously made from both academia and industry to understand the DDoS attacks and defend against them. With an ever-improving defense posture, the attack strategies are constantly changing as well; making DDoS attacks some of the most severe threats on the Internet. DDoS attacks, by nature, are difficult to defend against because: 1) it is hard to know in advance when an attack is launched, 2) where the attacking machines are from, 3) how many attacking machines are involved, and 4) how long an attack will last (among others).

Most Internet DDoS attacks are today attributed to larger interconnected and overly complex entities that belong to various botnets. For such botnet-based (commercialized) DDoS attacks, understanding the underlying relationships between various attacks and attackers is fundamental in defending against the attacks. Particularly, are those relationships and efforts totally random? How do the attackers manage their resources? Can we estimate attack origins, sizes, duration, start time, and magnitude based on historical data? If there are some patterns in these attacks, can we learn and utilize them to improve the existing defenses? Apparently, understanding the latest attacking strategies and postures is key to the success of any defense.

To pursue this work, and as a starting point, we relied on 50,704 different Internet DDoS attacks across the globe, of which data is collected for a seven-month periods operationally. These attacks were launched by 674 botnet generations from 23 different botnet families with a total of 9026 victim IPs belonging to 1074 organizations that are collectively located in 186 countries. To sum up, we design and develop a data-driven and model-guided approach to defending against application-level distributed denial of service (DDoS) attacks by botnets.

Representative Publications

  • PDF Examining the Robustness of Learning-Based DDoS Detection in Software Defined Networks. Ahmed Abusnaina, Aminollah Khormali, Daehun Nyang, Murat Yuksel and Aziz Mohaisen. The 2019 IEEE Conference on Dependable and Secure Computing (IEEE DSC 2019) Best Paper--Runner Up
  • PDF A Data-Driven Study of DDoS Attacks and Their Dynamics. A. Wang, W. Chang, S. Chen, and A. Mohaisen. IEEE Transactions on Dependable and Secure Computing (IEEE TDSC 2018)
  • PDF An Adversary-Centric Behavior Modeling of DDoS Attacks. A. Wang, A. Mohaisen and S. Chen: IEEE International Conference on Distributed Computing Systems (IEEE ICDCS 2017)

Systems for Scalable Measurements and Monitoring

In the zettabyte era, per-flow measurement becomes more challenging for the data center owing to the increase of both traffic volumes and the number of flows. Also, the swiftness of the detection of anomalies (e.g., congestion, link failure, and DDoS attack) becomes paramount.

Scaling Up Per-flow Measurement using DRAM For fast and accurate traffic measurement, managing an accurate working set of active flows (WSAF) from massive volumes of packet influxes at line rates is a key challenge. WSAF is usually located in high-speed but expensive memory, such as TCAM or SRAM, and thus the number of entries to be stored is quite limited. To cope with the scalability issue of WSAF, in the first phase of this dissertation, we propose to use In-DRAM WSAF with scales and put a compact data structure (FlowRegulator) in front of WSAF to compensate for DRAM's slow access time by substantially reducing massive influxes to WSAF without compromising measurement accuracy. To verify its practicability, we further build a per-flow measurement system, called InstaMeasure, on an off-the-shelf Atom (lightweight) processor board. We evaluate our proposed system by a large scale real-world experiment (connected to monitoring port of our campus main gateway router for 113 hours, and capturing 122.3 million flows).

Scaling Up Per-flow Measurement Using Sampling: In the second piece of this dissertation, we aimed to design a novel sampling scheme to deal with the poor trade-off provided by random sampling. Starting with a simple idea that "independent per-flow packet sampling provides the most accurate estimation of each flow," we introduced a new concept of per-flow systematic sampling, to provide the same sampling rate across all flows. Besides, we realized the design of a concrete sampling method called SketchFlow, which approximates the idea of the per-flow systematic sampling using a sketch saturation event.

System Integration: For the last part of this research project, and system-wide, we proposed an SDN-based WLAN monitoring and management framework called RFlow+ to address WiFi service dissatisfaction caused by the limited view (lack of scalability) of network traffic monitoring and absence of intelligent and timely network treatments. Existing solutions (e.g., OpenFlow and sFlow) have a limited view, no generic flow description, and a poor trade-off between measurement accuracy and network overhead depending on the selection of the sampling rate. To resolve these issues, we devise a two-level counting mechanism, namely a distributed local counter (on-site and real-time) and central collector (a summation of local counters). With this, we proposed a highly scalable monitoring and management framework to handle immediate actions based on short-term (e.g., 50 ms) monitoring and eventual actions based on long-term (e.g., one month) monitoring. The former uses the local view of each access point (AP), and the latter uses the global view of the collector.

Representative Publications

  • PDF SketchFlow: Per-Flow Systematic Sampling Using Sketch Saturation Event. Rhongho Jang, Daehong Min, Seongkwang Moon, Aziz Mohaisen, and DaeHun Nyang. in Proceedings of the 39th IEEE International Conf. onComputer Communications (IEEE INFOCOM 2020).
  • PDF InstaMeasure: Instant Per-flow Detection UsingLarge In-DRAM Working Set of Active Flows. R. Jang, S. Moon, Y. Noh, A. Mohaisen and D. Nyang. 39th IEEE International Conference on Distributed Computing Systems (IEEE ICDCS 2019)
  • PDF Large-scale Invisible Attack on AFC Systems with NFC-equipped Smartphones. F. Dang, P. Zhou, Z. Li, E. Zhai, A. Mohaisen, Q. Wen, and M. Li: IEEE International Conf. on Computer Communications (IEEE INFOCOM 2017)

VR/AR/Wearable Security and Privacy

Privacy leakage from elevation profiles: The extensive use of smartphones and wearable devices has facilitated many useful applications. For example, with Global Positioning System (GPS)-equipped smart and wearable devices, many applications can gather, process, and share rich metadata, such as geolocation, trajectories, elevation, and time. For example, fitness applications, such as Runkeeper and Strava, utilize information for activity tracking, and have recently witnessed a boom in popularity. Those fitness tracker applications have their own web platforms, and allow users to share activities on such platforms, or even with other social network platforms. To preserve privacy of users while allowing sharing, several of those platforms may allow users to disclose partial information, such as the elevation profile for an activity, which supposedly would not leak the location of the users. In this work, and as a cautionary tale, we create a proof of concept where we examine the extent to which elevation profiles can be used to predict the location of users. To tackle this problem, we devise three plausible threat settings under which the city or borough of the targets can be predicted. Those threat settings define the amount of information available to the adversary to launch the prediction attacks. Establishing that simple features of elevation profiles, e.g., spectral features, are insufficient, we devise both natural language processing (NLP)-inspired text-like representation and computer vision-inspired image-like representation of elevation profiles, and we convert the problem at hand into text and image classification problem. We use both traditional machine learning- and deep learning-based techniques, and achieve a prediction success rate ranging from 59.59% to 95.83%. The findings are alarming, and highlight that sharing elevation information may have significant location privacy risks.

AR/VR SecurityEnabling users to push the limits of the physical world, augmented reality (AR) and virtual reality (VR) platforms opened a new chapter in human perception. The novel immersive experiences resulted in the emergence of new interaction methods for virtual environments, which came along with their security and privacy risks that are never considered before. In this project, we explore a spatial side-channel keylogging attack to infer user inputs typed with in air tapping keyboards in virtual environments. We exploit the observation that hands follow certain patterns while typing in the air to initiate our attack. We introduce three plausible attack scenarios under which the adversary obtains the hand traces of the victim by either planting a small-sized hand tracker near the victim, keeping with a close proximity to the victim, or tricking the victim into installing a malicious application. Our five-step pipeline takes the hand traces of the victim and outputs a set of inferences ordered from the best to worst. Through our experiments, we achieved pinpoint accuracy ranging from 40% to 87% within at most top-500 candidate reconstructions. We discuss possible countermeasures, while the results presented provide a cautionary tale of the potential security and privacy risk of the immersive mobile technology.

Representative Publications

  • PDF Understanding the Potential Risks of Sharing Elevation Information on Fitness Applications. Ulku Meteriz, Necip Fazil Yıldıran, Joongheon Kim, and David Mohaisen. The 40th IEEE International Conference on Distributed Computing Systems (IEEE ICDCS 2020)
  • PDF Deep Fingerprinting Defender: Adversarial Learning-based Approach to Defend Against Website Fingerprinting. Ahmed Abusnaina, Rhongho Jang, Aminollah Khormali, DeaHun Nyang, and Aziz Mohaisen. in Proceedings of the 39th IEEE International Conf. onComputer Communications (IEEE INFOCOM 2020).
  • PDF You are a Game Bot!: Uncovering game bots in MMORPGs via self-similarity in the wild. E. Lee, J. Woo, H. Kim, A. Mohaisen, H. Kim: ISOC Network and Distributed System Security Symposium (ISOC NDSS 2016)

Detecting toxic contents online using natural language processing

Social media has become an essential part of the daily routines of children and adolescents. Enormous efforts have been made to ensure the psychological and emotional well-being of young users, including their interaction with social media. This study explores measuring the exposure of children and adolescents to age-inappropriate contents in YouTube comments posted on YouTube videos for the top-200 children's shows based on targeting eight categories. This task is challenging for several reasons. First, studying comments on children's videos requires manually collecting channels and shows targeting this demographic, knowing YouTube categories are not established by age-group but rather the topic they convey. Second, assigning age groups to the collected videos can be daunting in measuring exposure by separate groups. Third, the ground-truth data for safe and inappropriate contents are not age-specific. Finally, considering the variety of age-inappropriate contents for children, building a unified system for detecting such contents is a challenging task. We collected a large-scale dataset (approximately four million records) and studied the presence of eight age-inappropriate categories and the amount of exposure caused by each category. Using natural language processing and machine learning techniques, we constructed ensemble classifiers that achieved a high detection accuracy for inappropriate content using various evaluation metrics. Initial results show a large percentage of worrisome comments with inappropriate contents.