Fraud is an unfortunate but very real part of digital marketing and advertising, and no one who has worked in the industry for any decent length of time is a stranger to this.
The digital marketing industry, and specifically video marketing, had an extraordinary scare when in late 2016, digital advertising security company White Ops discovered that hackers faked IP addresses to defraud video advertisers out of millions of dollars. The hack was dubbed the Methbot scheme and took place during the 2016 holiday season where the cyber forgers made advertisers unknowingly pay up to $5 million a day for video views that weren’t actually happening. The attack carefully evaded many of the anti-fraud mechanisms that advertisers put in place.
At Eyeview, we have sophisticated techniques and algorithms to battle fraud and ensure our clients have access to the most valuable media. We ultimately weren’t affected by the attack, but this attack was a large scale attack; therefore, we decided to investigate the situation with more scrutiny and ensure there wasn’t a way for something similar to get past our safeguards in the future.
As part of the analytics team, my part of the investigation was ensuring that we were not serving on the fraudulent IP addresses. I was able to pull on techniques and tools already in my wheelhouse while learning some new ones to help me tackle the specific use-case of reviewing IP addresses across multiple executions and millions of ads served for specific addresses.
My investigation into the Methbot scheme and its potential impact on our clients was reliant on the integrated data systems we have at Eyeview. Ideally, I would want to import, clean and analyze the data with python on my local machine, but since I had to use data from Amazon S3, I could not do it all just within python locally, which meant that I had to move my data across different systems.
The challenge was evaluating how I was going to extract the IP address which were in a format I never seen before and join that file, which is close to 1 million rows when fully extracted, to our ad-serving data. I had to plan how I was going to clean the data, which means formatting the data in a way which it could be used for analysis. For example, my data had commas and hyphens at the end of every entry, and I had to eliminate them before I did any analysis. Next I had to import this data; again, if the data was not clean, I wouldn’t be able to import it for analysis. Lastly, I planned to analyze the data. Therefore, I decided my plan would be to clean the data using Python, import the data via FTP and analyze via our Spark Service Provider.
I learned about packages in Python to parse IP addresses, and from here I wrote a Python script to clean up and extract the IP address list. In this case, I ran Python on my local machine for extraction, but if this is an ongoing process where you need to move data across systems, you should consider running the Python code on a Spark notebook or another similar environment.
Managing the Data
Next, I had to move this data to an area where I could run a query with this large dataset. I moved the data to Amazon S3 (our cloud storage) because simply copying and pasting the IP addresses from my local machine to a Spark Service Provider wouldn’t work with the memory limitations. I used Spark to access the list of IP addresses in S3 and ran a few functions to make this a dataframe. I decided to use Spark SQL to query the IP addresses dataframe along with our ad-serving data in Amazon S3, using standard queries, subqueries and with clauses.
There are many ways that you could query the data and many different languages you can use for analysis within a good Spark Service Provider; your decision should be based on a combination of the following: what is most efficient, what you’re most comfortable with, and what uses the least amount of tasks or clusters. Executing this entire process, I was able to ensure that we were not serving on fraudulent IP addresses. Our clients hadn’t been affected by the Methbot scheme and wouldn’t be affected by a similar issue in the future.
It is always important to devise a plan from start to finish when analyzing data and to understand how systems work within each other. My plan for tackling the issue was cleaning the IP address file then importing that file to a Spark Service Provider so I can run data analysis with our ad serving data. It was challenging to find how to to clean this data since I have never worked with this IP address cleaning before and it was challenging figuring out how I was going to process all this large data in an efficient matter. A key to being a good data analyst is understanding if you can navigate data through different systems and if you can’t, being able to know how you can manipulate the data so you can. At Eyeview, we understand this and how important it is, which makes us well-equipped to handle data issues related to potential fraud issues in the future.