Analyzing web application data
Learn how to effectively analyze and clean web application data, ensure consistency across logs, enrich the dataset with additional information, explore traffic patterns, identify anomalies, and build accurate models for fraud detection.
Types of data
In the world of security, you'll typically encounter either web server logs or application-level logs.
Web server logs
These are generated by the web server, and represent requests that the web server has received. Each row is one request, and usually includes some basic data about the web server's response.
Here are some sample logs from NGINX, a popular web server:
107.189.10.196 - - [14/Feb/2022:03:48:55 +0000] "POST /HNAP1/ HTTP/1.1" 404 134 "-" "Mozila/5.0"
35.162.122.225 - - [14/Feb/2022:04:11:57 +0000] "GET /.env HTTP/1.1" 404 162 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
45.61.172.7 - - [14/Feb/2022:04:16:54 +0000] "GET /.env HTTP/1.1" 404 197 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
45.61.172.7 - - [14/Feb/2022:04:16:55 +0000] "POST / HTTP/1.1" 405 568 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
45.137.21.134 - - [14/Feb/2022:04:18:57 +0000] "GET /dispatch.asp HTTP/1.1" 404 134 "-" "Mozilla/5.0 (iPad; CPU OS 7_1_2 like Mac OS X; en-US) AppleWebKit/531.5.2 (KHTML, like Gecko) Version/4.0.5 Mobile/8B116 Safari/6531.5.2"
23.95.100.141 - - [14/Feb/2022:04:42:23 +0000] "HEAD / HTTP/1.0" 200 0 "-" "-"
217.138.222.101 - - [14/Feb/2022:07:38:40 +0000] "GET /icons/ubuntu-logo.png HTTP/1.1" 404 197 "http://168.119.119.25/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
217.138.222.101 - - [14/Feb/2022:07:38:42 +0000] "GET /favicon.ico HTTP/1.1" 404 197 "http://168.119.119.25/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
217.138.222.101 - - [14/Feb/2022:07:44:02 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
217.138.222.101 - - [14/Feb/2022:07:44:02 +0000] "GET /icons/ubuntu-logo.png HTTP/1.1" 404 197 "http://168.119.119.25/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"These sample logs contain some common fields. These fields will usually always be present in logs from all major servers, although their order or format may differ from server to server.
- Client IP address - Note that depending on the configuration of the server, this client IP address may not always be accurate.
- Timestamp
- URL - The URL is typically relative to the server root. So
/represents a request for the root directory. - Method -
GETandPOSTrequests are most common, but you may also seeHEAD,OPTIONS,PUT, and so on. The request body will vary depending on the method type. - Response Code - 200 typically means "okay", 404 "page not found", and 500 "internal server error", but this varies from server to server. Mozilla provides a good summary here, and you can also dive into the RFC spec here. (Note that not all servers or web applications are configured to follow this spec.)
- Request size
- User agent - This is the user agent string provided by the browser. Note that this is self-declared and can be easily spoofed by attackers, although the vast majority of (legitimate) traffic will have user agent strings that make sense.
Application logs
Application logs will contain a great deal more information than simple web traffic logs. They capture useful context about the application itself and allow us to infer what is actually happening.
Because developers have full control over what data to log, application logs vary extensively, and there is no standard format. This makes application logs more challenging to process.
That said, application logs are typically in CSV, JSON, or some form of key=value pairs. Here is a sample antivirus log from Fortinet:
date=2019-05-13 time=11:45:03 logid="0211008192" type="utm" subtype="virus" eventtype="infected" level="warning" vd="vdom1" eventtime=1557773103767393505 msg="File is infected." action="blocked" service="HTTP" sessionid=359260 srcip=10.1.100.11 dstip=172.16.200.55 srcport=60446 dstport=80 srcintf="port12" srcintfrole="undefined" dstintf="port11" dstintfrole="undefined" policyid=4 proto=6 direction="incoming" filename="eicar.com" quarskip="File-was-not-quarantined." virus="EICAR_TEST_FILE" dtype="Virus" ref="http://www.fortinet.com/ve?vn=EICAR_TEST_FILE" virusid=2172 url="http://172.16.200.55/virus/eicar.com" profile="g-default" agent="curl/7.47.0" analyticscksum="275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f" analyticssubmit="false" crscore=50 craction=2 crlevel="critical"And here is a set of sample settled transactions in JSON from Authorize.net, queryable via an API:
{
"getTransactionListResponse": {
"-xmlns": "AnetApi/xml/v1/schema/AnetApiSchema.xsd",
"messages": {
"resultCode": "Ok",
"message": {
"code": "I00001",
"text": "Successful."
}
},
"transactions": {
"transaction": [
{
"transId": "12345",
"submitTimeUTC": "2009-05-30T09:00:00",
"submitTimeLocal": "2009-05-30T04:00:00",
"transactionStatus": "settledSuccessfully",
"invoice": "INV00001",
"firstName": "John",
"lastName": "Doe",
"amount": "2.00",
"accountType": "Visa",
"accountNumber": "XXXX1111",
"settleAmount": "2.00",
"subscription": {
"id": "145521",
"payNum": "1"
},
"profile": {
"customerProfileId": "1806660050",
"customerPaymentProfileId": "1805324550"
}
},
{
"transId": "12345",
"submitTimeUTC": "2009-05-30T09:00:00",
"submitTimeLocal": "2009-05-30T04:00:00",
"transactionStatus": "settledSuccessfully",
"invoice": "INV00001",
"firstName": "John",
"lastName": "Doe",
"amount": "2.00",
"accountType": "Visa",
"accountNumber": "XXXX1111",
"marketType": "eCommerce",
"product": "Card Not Present",
"mobileDeviceId": "2354578983274523978"
}
]
},
"totalNumInResultSet": "2"
}
}
Workflow for analyzing log data
This is a typical workflow for analyzing log data, and it should be familiar to experienced data scientists. Your exact steps will depend on your goals. Are you trying to investigate a particular case of fraud? Count the number of real human users in your application? Establish your baseline level of automated traffic? Or are you trying to build a fraud detection model?
Extract, load and transform the data
- This is often known as the "ELT" process (some may call it "ETL" if transformation is done before loading) and involves getting the data from the source to the destination (often a data warehouse). If you are lucky and have a good data engineering team, this is often something that they can perform for you. If you are at a small startup or the data engineering team is booked out, this is often something that you will have to do yourself.
- Either way, the "original source" comes in many formats: CSV, TSV, JSONL (newline-delimited JSON), parquet, ElasticSearch, or even Google Sheets/Excel! The destination depends on data size and whether you intend for it to be reused: If the data is sufficiently small, you can load the CSV directly into Python/Pandas. If you want it to be reused, consider loading the data into Snowflake, BigQuery, Redshift, or your data warehouse of choice (highly recommended).
- If the dataset is huge, consider loading just a small sample (perhaps a few thousand rows) initially. Study the sample, and identify any necessary transformations that you may need to perform. Or if the quality is simply terrible, don't waste your time. Consider asking for a new dataset!
- Cleanliness: Most data scientists spend a huge amount of time wrangling and cleaning data. Here are some common hygiene issues when dealing with web logs:
- Formatting: If the data comes from Google Sheets or Excel, or has otherwise been manually entered, be careful! Unexpected formatting (e.g., integers being formatted as strings, and dates being formatted as strings or represented as integers) will often throw you off. Look out also for whitespaces.
- Time zones: What time zone are timestamps shown in? Is it UTC, or server time? Unix timestamps typically don't contain time zone information, and so would be UTC.
- Multiple files: If you are dealing with multiple data files (especially files generated over time), check if the number of columns and the order of columns in each file match up. Are timestamps shown in a consistent format across all files?
Enrich the data if necessary
- A common enrichment is to obtain the ASN and country/city for each IP address, or even some IP risk/reputation score.
- Another option is to join each log entry with some application backend data, such as the current account balance for each successful login.
- Understand the data schema, coverage, and cleanliness.
- Schema: Watch out for nested fields if you are dealing with JSON. Not everything can be flattened, but try to flatten as much as possible before you start your analysis.
- Coverage: If the logs represent web traffic, spend some time understanding how the logs are generated and what traffic is actually covered. What endpoints and applications are covered? Do the logs contain only POST data, without any GETs? Are there GET requests to only entry-point pages, or also to other non-HTML assets as well (e.g., images, javascript files, CSS files)? Do they only contain successful POSTs? Are you dealing with human-generated traffic, API requests, or a mix of both? Are these all requests, or only requests that passed some anti-bot solution or some geoblocking rules? Are the IP addresses actually end-user IPs, or do you have to extract them from certain request headers (e.g., the
X-Forwarded-Forheader)? Is response data generated by the origin server also included, or only request data? Were any rate limits imposed on the traffic?
Explore the dataset
- Log data is always timestamped, and you should aggregate the data into various dimensions, and visualize traffic from these dimensions in a series of time series charts. Look at the charts at different resolutions: bucketed by day, hour, 10 minutes, and per minute. For most applications, traffic should be higher during weekdays and lower during weekends and public holidays. Also, if your application has expected traffic spikes, check if they are present. (Expected traffic spikes might include Nordstrom performing a shoe drop, a lottery website holding drawings on fixed dates, or an unemployment claims website where claims must be submitted by a certain deadline.)
- You should also compute useful metrics for different subsets of data. If you need inspiration, a good starting point would be the sample features outlined in the course notes.
- Remember that exploration can take forever. So set aside a reasonable amount of time for exploration before diving into it! Focus on exploration that will move you toward your analytics goal.
Identify anomalies
- For the traffic as a whole, look at the top 10-20 ASNs, UAs, IP addresses, countries, URLs, and other application-specific data (such as usernames). Is there anything unexpected? Identify strange user agents, traffic from unexpected countries or ASNs, etc.
- Are the traffic volumes to different endpoints expected? For example, if your application requires a login and has a short session length (e.g., a banking website), there should be much less traffic for any post-login endpoint than the login endpoint itself.
- For each endpoint, are there any unexpected referers? For example, if you only have two entry points for your "forgot password" endpoint, you should not expect to see much traffic with referers such as
www.google.comorhttp://localhost:5000. (The latter especially would indicate a script running off someone's local machine!) Also, if your endpoint is expecting some traffic from an entry point on the same website, it would be suspicious to see requests with a blank referer. Such traffic is often directly sent from a script, without prior navigation or human interaction. - Pay attention to every anomaly that you observe. Just like good software engineers, attackers often run small-scale tests as they develop their scripts, before deploying them into "production." For example, you might see a carelessly written UA string, and that might be a prelude to a significant attack from the same IP address with properly written UA strings.
Additional resources
If you are new to exploring large datasets, here are some useful resources that will help you get started. These tutorials focus on exploring data, not the ELT process or loading data; however, as a good data scientist, you should be comfortable working with data end-to-end.
- https://www.digitalocean.com/community/tutorials/exploratory-data-analysis-python (short 5-min read, with some useful steps highlighted)
- https://www.kaggle.com/code/ekami66/detailed-exploratory-data-analysis-with-python/notebook (example is geared towards broader datasets, i.e. those with more fields/variables)
- https://www.datacamp.com/tutorial/exploratory-data-analysis-python (clean and straightforward tutorial)
- https://opensource.com/article/19/5/visualize-log-data-apache-spark (this is quite relevant as it features web log data, although the tutorial uses Spark, not Pandas)
- https://blog.networktocode.com/post/introduction-to-pandas-for-network-development/ (this features network traffic, not your typical server logs – but still really useful)
Building models
If you are building a model, consider:
- What is the objective of your model – What is the model trying to predict?
- Do you want to build a simple rules-based model, a supervised ML model, or an unsupervised model? Rules-based models are often easier to get started with and are good for detecting adversarial traffic that has not been blocked. They are also often directly explainable. ML models are more durable and generalizable, but they require more upfront effort. Rules-based models often perform similarly to overfitted ML models, in that they are hypersensitive to small changes in feature values and hence do not generalize well. If you are using ML, be aware that fraud datasets are often extremely imbalanced and will require some corrective preparation (such as oversampling fraud or generating synthetic training data using something like SMOTE). You'll also need to compute your model performance correctly.
- How will the model be deployed? If your model is going to be used strictly for offline analysis and prediction (e.g., once a day), then you can usually rely on computing features directly from the data warehouse. If you will need to deploy the model in real-time (with a sub-100ms response time), then consider carefully how your features will be computed. You will need to hold some data in some low-latency storage or explore using data systems like graph databases.
Other tips
Process data as close to the source as possible
Your fraud data science workflow will vary depending on the types of problems that you are trying to solve, as well as the systems that you are working with. However, a good rule of thumb is to process data as close to the source as possible. This minimizes processing time. Many data scientists try to wrangle a large amount of data within Python and Pandas, and they find themselves limited by the fact that Pandas does not scale very well across multiple CPUs. For example, at MetaMap, our training data is often prepared directly within our data warehouse (Google BigQuery) using SQL.
Use external data
This is something that most security/fraud data scientists do not do enough. Attackers thrive on defenders working within their own silos. Oftentimes, an attacker will be committing fraud not only on your application but on other similar applications from your competitors as well. Reach out to them in the name of security, and see if you can establish an anonymous information-sharing platform to alert each other of attack tactics, techniques, and procedures (TTPs). Vendors will also often sell you such data, but it is often delayed and imprecise.
Outsource fraud prevention
There is a case for outsourcing aspects of fraud prevention to specialist vendors. However, successful outsourcing requires a strong partnership between the vendor and the fraud/analytics teams of the client. Outsourcing is most effective when clients already have a strong understanding of their own data, as it is impossible for vendors to obtain and work with 100% of the data that the client has access to!
Member discussion