Identify people through their Google Search based on typing behaviour

Nicklas Körtge
Analytics Vidhya
Published in
6 min readJan 23, 2021

--

How to extract Keystroke Biometrics from encrypted Google search traffic.

Introduction

The concept of personal data is the basic element of the General Data Protection Regulation. It includes all information relating to an identified or identifiable natural person. Within the framework of the GDPR, an attempt is made to protect those data particularly. But also via so-called metadata, which is not explicitly protected by the GDPR, it is possible to draw conclusions about the identity of individuals. A functionality that offers the potential to assign information from the resulting metadata is the Search Suggestion Function (SSF) of search engines.

Search Suggestion Functions (SSF) of search engines is the automatic suggestion and completion of possible search queries based on the typing sequence already entered.

This functionality, by recording the network data stream when a search query is entered with the identification of the relevant network data packages, makes it possible to measure the typing biometrics generated when the user enters the keyboard based on these packages.

Source: Keystroke biometrics in the encrypted domain: a first study on search suggestion functions of web search engines by Nicholas Whiskerd, Nicklas Körtge, Kris Jürgens, Kevin Lamshöft, Salatiel Ezennaya-Gomez, Claus Vielhauer, Jana Dittmann and Mario Hildebrandt, CC BY 4.0

The figure visualises the relationship between the input of the word “weather” and the keystroke biometric. The inter-keystroke times (these are time values that indicate the elapsed time between two keystrokes) can be calculated and evaluated from the timestamps of the packages of a typing sequence.

This article describes how to extract these relevant packets from a snapshot of network traffic.

For more theoretical background, please have a look on this paper.

Approach

The technical approach to realize the functionality of extraction biometric information is, dividing the solution over different components. At first the network traffic is sniffed with Wireshark. To provide the captured traffic to our own application, the data will be stored into a PCAP file. The developed software will access this data and apply a filter on it. The filter contains a set of values with which the pattern algorithm extract the relevant packets of each keystroke sequence. From there, the inter-keystoke times can be calculated as a measurable representation of the keystroke biometric.

  • Network traffic: Wireshark is used to capture the network traffic. The result will be exported as a PCAP file (Wireshark/tcpdump/…-pcap).
  • Filter: To configure the different parameters for the algorithm, an extra filter is created for each of the different environments.
  • Keystroke Biometric: The keystroke biometric will be exported as inter-keystroke times in ARFF-format.

Realisation

Preface: This article covers and explains only the main logic of the application. The full program is available on GitHub.

For capturing keystrokes of a person using Firefox on a PC, the configuration file looks like this:

The file contains five different sections with some subsections:

  • IP version: IP version: The IP version is initially specified here. Due to a possible automatic detection of the used IP version, this option might become obsolete in future versions. For now, the IP version used must be provided via the configuration file.
  • Port: This value specifies the destination port, by which the network traffic will be filtered.
  • Input Phrase: The input phrase value can be configured to compare the recognized keystrokes with the actual expected letter of the input phrase. This option was part of the quality check during development and is only available when network traffic is decrypted and blind mode is turned off.
  • System: this string value defines the system on which the search query was generated. The different device types and operating systems in combination have an influence on which pattern algorithm can be used to detect packets in network traffic. Web browsers only affect the window sizes (next bullet point), but have no influence on the logical process of searching for the correct network packets.
  • Windows: This part of the configuration is the most important. It defines the ranges for which the patter-algorithm will detects different types of packages from a keystroke stream, such as the first (start) package of a sequence. For a more detailed explanation of the different values, see the explanation of the pattern-algorithm.
  • Faulty Stream Counter: this value specifies the maximum number of detected packets that are not part of the keystroke stream before the pattern algorithm returns that the current typing sequence is finished and possibly a new one starts.

Pattern-Algorithm

The pattern algorithm forms the main logic of this application. The following code was created and improved by continues testing until the results were good enough.

The detected pattern for packages from a typing sequence when using firefox on a PC.

Before the pattern algorithm for recognizing packets from a keystroke stream can be developed, the actual pattern must first be defined. For search and typing. in the Google search engine by using Firefox on a PC,the pattern looks as visualized in the figure above. The size of the first packet corresponding to the first keystroke entered is in the range of 220 and 350 bytes. To recognize that this is the first packet of a keystroke stream, the next packet (from the sequence of packets pre-filtered by destination port, destination IP and TCP payload) has a size between 200 and 270 bytes. EAny other incoming packet that is part of this keystroke sequence has a size in the range of 200 to 280 bytes and is no larger or smaller than the size of the previous packet plus or minus 25 bytes. A keystroke sequence is terminated if, after the last valid packet, more than 10 (faulty stream counter) packets are detected that do not meet the requirements (size between 200 and 280, +/- 25 bytes from the previous one) for a packet that is part of the keystroke stream.

Within the implementation of the pattern algorithm, the network stream is first filtered by:

  • Destination port: check if the destination-port is 443
  • Destination IP: check if the destination IP address of the packet is an address from Google
  • Payload: check if the package contains a payload

If a packet of the network stream meets all three requirements, the lookup_for_new_stream() function is called, which checks whether this packet is a possible start of a keystroke stream by calling is_new_stream.

If at least one beginning of a stream is detected, the lookup_for_new_packages_of_current_stream() function is called for each new package passing the filter. Inside this function is_second_package_of_stream is called if the currently selected stream contains only one package to check whether the current package fulfil the properties of a second package.

In the same way, theis_next_package function inside lookup_for_new_packages_of_current_stream() is called every timeis_second_package_of_stream returns false. For the same selected package, this function checks if it qualifies to be the next package of the keystroke stream.

Around this logic for detecting the pattern of a keystroke sequence within a network stream, further functions for managing the information have to be implemented.

The full code is available via Github.

Results

By collecting a sufficient number of inter-keystroke times from a given individual, biometric typing behavior can be mapped as a frequency distribution across all data point

This figure visualizes the biometric typing behavior of two different individuals collected by capturing encrypted Google search traffic (x-axis: relative frequency, y-axis: inter-keystroke-time in seconds).

By comparing different distributions of typing behaviour it may be possible to detect and identify a person just by observing encrypted network traffic!

Quality

To describe the quality and correctness of the pattern algorithm, the two types of errors classes can be used.

False Negatives:

False Positives:

  • Captured network traffic without using google search
  • around 11000 packages detected with Wireshark
  • without checking for ip and port, the pattern algorithm detects no keystroke sequence

--

--

Nicklas Körtge
Analytics Vidhya

Researcher and Software Engineer @IBM Research ZRL, Quantum Safe, Security