data_science

Analyzing web application data

Yiing Chau Mak

02 Apr 2023 • 10 min read

Learn how to effectively analyze and clean web application data, ensure consistency across logs, enrich the dataset with additional information, explore traffic patterns, identify anomalies, and build accurate models for fraud detection.

📕

This article forms part of the notes from Week 2 of the Data Science for Security and Fraud online course. Access the full course outline here.

Types of data

In the world of security, you'll typically encounter either web server logs or application-level logs.

Web server logs

These are generated by the web server, and represent requests that the web server has received. Each row is one request, and usually includes some basic data about the web server's response.

Here are some sample logs from NGINX, a popular web server:

107.189.10.196 - - [14/Feb/2022:03:48:55 +0000] "POST /HNAP1/ HTTP/1.1" 404 134 "-" "Mozila/5.0"
35.162.122.225 - - [14/Feb/2022:04:11:57 +0000] "GET /.env HTTP/1.1" 404 162 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
45.61.172.7 - - [14/Feb/2022:04:16:54 +0000] "GET /.env HTTP/1.1" 404 197 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
45.61.172.7 - - [14/Feb/2022:04:16:55 +0000] "POST / HTTP/1.1" 405 568 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
45.137.21.134 - - [14/Feb/2022:04:18:57 +0000] "GET /dispatch.asp HTTP/1.1" 404 134 "-" "Mozilla/5.0 (iPad; CPU OS 7_1_2 like Mac OS X; en-US) AppleWebKit/531.5.2 (KHTML, like Gecko) Version/4.0.5 Mobile/8B116 Safari/6531.5.2"
23.95.100.141 - - [14/Feb/2022:04:42:23 +0000] "HEAD / HTTP/1.0" 200 0 "-" "-"
217.138.222.101 - - [14/Feb/2022:07:38:40 +0000] "GET /icons/ubuntu-logo.png HTTP/1.1" 404 197 "http://168.119.119.25/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
217.138.222.101 - - [14/Feb/2022:07:38:42 +0000] "GET /favicon.ico HTTP/1.1" 404 197 "http://168.119.119.25/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
217.138.222.101 - - [14/Feb/2022:07:44:02 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
217.138.222.101 - - [14/Feb/2022:07:44:02 +0000] "GET /icons/ubuntu-logo.png HTTP/1.1" 404 197 "http://168.119.119.25/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"

These sample logs contain some common fields. These fields will usually always be present in logs from all major servers, although their order or format may differ from server to server.

Client IP address - Note that depending on the configuration of the server, this client IP address may not always be accurate.
Timestamp
URL - The URL is typically relative to the server root. So / represents a request for the root directory.
Method - GET and POST requests are most common, but you may also see HEAD, OPTIONS, PUT, and so on. The request body will vary depending on the method type.
Response Code - 200 typically means "okay", 404 "page not found", and 500 "internal server error", but this varies from server to server. Mozilla provides a good summary here, and you can also dive into the RFC spec here. (Note that not all servers or web applications are configured to follow this spec.)
Request size
User agent - This is the user agent string provided by the browser. Note that this is self-declared and can be easily spoofed by attackers, although the vast majority of (legitimate) traffic will have user agent strings that make sense.

Application logs

Application logs will contain a great deal more information than simple web traffic logs. They capture useful context about the application itself and allow us to infer what is actually happening.

Because developers have full control over what data to log, application logs vary extensively, and there is no standard format. This makes application logs more challenging to process.

That said, application logs are typically in CSV, JSON, or some form of key=value pairs. Here is a sample antivirus log from Fortinet:

date=2019-05-13 time=11:45:03 logid="0211008192" type="utm" subtype="virus" eventtype="infected" level="warning" vd="vdom1" eventtime=1557773103767393505 msg="File is infected." action="blocked" service="HTTP" sessionid=359260 srcip=10.1.100.11 dstip=172.16.200.55 srcport=60446 dstport=80 srcintf="port12" srcintfrole="undefined" dstintf="port11" dstintfrole="undefined" policyid=4 proto=6 direction="incoming" filename="eicar.com" quarskip="File-was-not-quarantined." virus="EICAR_TEST_FILE" dtype="Virus" ref="http://www.fortinet.com/ve?vn=EICAR_TEST_FILE" virusid=2172 url="http://172.16.200.55/virus/eicar.com" profile="g-default" agent="curl/7.47.0" analyticscksum="275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f" analyticssubmit="false" crscore=50 craction=2 crlevel="critical"

And here is a set of sample settled transactions in JSON from Authorize.net, queryable via an API:

{
  "getTransactionListResponse": {
    "-xmlns": "AnetApi/xml/v1/schema/AnetApiSchema.xsd",
    "messages": {
      "resultCode": "Ok",
      "message": {
        "code": "I00001",
        "text": "Successful."
      }
    },
    "transactions": {
      "transaction": [
        {
          "transId": "12345",
          "submitTimeUTC": "2009-05-30T09:00:00",
          "submitTimeLocal": "2009-05-30T04:00:00",
          "transactionStatus": "settledSuccessfully",
          "invoice": "INV00001",
          "firstName": "John",
          "lastName": "Doe",
          "amount": "2.00",
          "accountType": "Visa",
          "accountNumber": "XXXX1111",
          "settleAmount": "2.00",
          "subscription": {
            "id": "145521",
            "payNum": "1"
          },
          "profile": {
            "customerProfileId": "1806660050",
            "customerPaymentProfileId": "1805324550"
          }
        },
        {
          "transId": "12345",
          "submitTimeUTC": "2009-05-30T09:00:00",
          "submitTimeLocal": "2009-05-30T04:00:00",
          "transactionStatus": "settledSuccessfully",
          "invoice": "INV00001",
          "firstName": "John",
          "lastName": "Doe",
          "amount": "2.00",
          "accountType": "Visa",
          "accountNumber": "XXXX1111",
          "marketType": "eCommerce",
          "product": "Card Not Present",
          "mobileDeviceId": "2354578983274523978"
        }
      ]
    },
    "totalNumInResultSet": "2"
  }
}

Workflow for analyzing log data

This is a typical workflow for analyzing log data, and it should be familiar to experienced data scientists. Your exact steps will depend on your goals. Are you trying to investigate a particular case of fraud? Count the number of real human users in your application? Establish your baseline level of automated traffic? Or are you trying to build a fraud detection model?

Types of data

Web server logs

Application logs

Workflow for analyzing log data

Extract, load and transform the data