Unraveling World Cup 2019 - A Data Story

Posted on Aug 16, 2019 by Hrishikesh Joshi, Ashish Mohite in AIMediaEntertainment


Let us start with explaining you in brief the business problem in Indian Media Industry that we intend to tackle. There are majorly 4 stakeholders viz. Advertisers, Media Agencies, Broadcasters and Broadcast Audience Research Council (BARC) – central rating agency.

Advertisers hire Media Agencies to make an optimized media plan for TV/Digital/Print/Radio. In our case we will be focused only on TV which alone contributes to 44% of overall Media Industry (“A billion screens of opportunity” a report by FICCI-EY 2019). During campaign planning phase, many media agencies rely on cognitive bias rather than on data. The reason being, viewership data is made available post-airing so there is no way of knowing the content performance before the actual airing.

Using advanced Machine Learning algorithms, we try to determine this performance well in advance which helps the agencies take better planning decisions.To read more in detail on this, please refer to our previous blog: India Media Industry, its problems and scope for AI

In India Cricket is not just a sport it’s a sentiment that unites the nation, which means a lot of TV viewership and naturally this gives opportunity to a lot of advertisers and agencies. Keeping this in mind we took it upon us to make sure that these advertisers/agencies take the most informed decision for ICC World Cup 2019.


There are two deliverables that we offered:

  1. Dashboard
  2. Report

Dashboard helps the user derive his own conclusions from the data. The data is visualized in such a way the it becomes most hassle-free to communicate their own data story. Dashboard helps the user get a perspective on various factors that affect the viewership of a match. These factors can be classified into two broad categories:

  • Known Factors
  • Unknown Factors

Know factors are those that are more factual than derived. Factors like time of the match, day of the week, venue etc.

Unknown factors are those that are derived form our understanding of the data, like match dynamics. These are the tricky ones.

The second part of the deliverable is the report. This is basically a platter of insights put together with a sole purpose of call to action. Let’s go through some of the interesting findings and how these affect our goal of projecting match-by-match viewership of World Cup 2019.

ICC T20 World Cup 2016 alone accounted to 13% of the total cricket viewership since October 2015, which means that Cricket World Cup is the biggest viewed sports event in India. T20 format garners 1.4x viewership that ODI of Test. Matches were India is playing garner 15x viewership than the matches where India is not playing. Also, matches in India garner 5x viewership that those happening outside India. Of course, there are other elements at play but if we are just looking at these numbers in isolation, we understand that had the world cup been in India it would have rated higher that in England.

Factors like telecast channel, hour of the day and day of the week are amongst the important “known factors” in determining the final viewership.

We observed that there is a stark co-relation between the TV viewership numbers and Google Trends data.

For Google Trends we took into account the Team Name searches in India. We see that there is a spike in the search for Indian National Cricket Team and Pakistan National Cricket Team during ICC Champions Trophy Match 4. The highest spike in terms of viewership and google tends.

The other and a more important parameter in determining the viewership is the match dynamics. In order to estimate the match dynamics, we first need to get a good hang of the team composition and performance. To understand this, we plotted the individual player performances on 3 scatter plots viz. Batting Matrix, Bowling Matrix and Allrounder Matrix.

Batting Matrix:

On Y-axis we have the form of a player which is calculated as follows

Batsman Form = (Runs scored per match in last 10 matches/Career average of runs per match)*100

On X-Axis we have Runs scored per match in last 10 matches and Strike Rate per match in last 10 matches.

England would definitely be a challenging side for opposition bowlers EoinMorgan, Jason Roy, Jos Buttler are the top of their form. Pakistan also has a strong batting line-up with batsmen such as Babar Azam, Fakhar Zaman and Sarfaraz Ahmed

Bowling Matrix:

On Y-axis we have the form of a player which is calculated as follows

Bowler Form = (Wickets per match in last 10 matches/Career average of wickets per match)*100

On X-Axis we have wickets scored per match in last 10 matches and economy per match in last 10 matches.

India has a top-notch bowling line-up with Bhuvneshwar Kumar, Mohammed Shami, Yuzvendra Chahal in to form. Also Mohammed Shami and Jasprit Bumra have a great economy record. Trent Boult (NZ), Adil Rashid (ENG), Matt Henry (NZ), Mujeeb Ur Rahman (AFG), Dale Steyn (SA) are some of the bowlers to watch out for

All-Rounder Matrix:

First matrix is wickets vs runs per match for last 10 while second match has the all-rounder’s bowling form vs batting form

Chris Woakes (ENG), Shakib Al Hasan (BAN), Mohmmad Nabi (AFG), Marcus Stoinis (AUS) are amongst the top All-rounders. India’s wobbly middle order can be attributed to the underperformance of Hardik Pandya, Vijay Shankar and Dinesh Kartik. This could be fatal against teams like England and Australia.

We also analyzed the overall team performance and this is how the overall team performances look.

Team Runs vs Team Wickets per match over last 10 matches

Team overall batting form vs Team overall batting form

These matrices reveal some interesting insights. Australia is the most likely teams to win the world cup, however England is also just right there with an added home advantage. India, New Zealand, South Africa and Afghanistan are Bowling heavy teams while Pakistan is a more Batting heavy team. Amongst the tail enders are Bangladesh, Sri Lanka and West Indies.

The purpose of doing this was to understand the kind of clashes that we can expect between two teams. For example, it is likely that clashes between the teams in Q3 of first matrix are likely to be low target matches.

Since we know that batting attracts more viewership than bowling, we can assume that these matches would have relatively low viewership as oppose to matches between Australia, England or Pakistan. Based on this we make some corrections in our projections.

Project Flow:

Let’s take a look at the project workflow step by step:

Data Extraction/Scraping:

Cricbuzz was our main source of data. But collecting data manually was impossible so we wrote a scraper in Python.

Finding the links for every tournament was a difficult task as cricbuzz uses slugs instead of integer identifier which we could have used by simply incrementing the number in url.
So we decided to take help of google ;)
Google search request has URL something like this https://www.google.co.in/search?q="your search query"
So, We made a list of tournaments that we want to find the matches of and then used the above query technique to programatically get the links for listing pages on cricbuzz.

Once we have found the listing page of matches next challenge was to find the links to details page of each match from the list and also detect type of the match. Finding the match type(T20, ODI, Test) has few catches as cricbuzz uses angular ng tags it's not easy to scrape using css query so we used to find ng attrubutes and then the regex match was used to find the type of the match

After we found the type of the match we wrote 3 different scrapers for 3 types of matches. And then all the data was pushed into a CSV/JSON

Once we have the data in CSV/JSON we can apply our model

  • You need to be aware of ip blacklisting while scraping. Be gentle to the website by not attacking it with thousands of requests in very short span of time. Instead, use time.sleep to make sure that you are not banned by the website for DDOS attack
  • Using selenium with chrome driver tricks sites in to believing that the user is atual person instead of a bot.

Past viewership data can be obtained through a licensed software; BARC India Media Workstation (BMW).

Google Trends for Digital Buzz:

For this we extracted 5 years of Google trends data for all the teams. Google trends gives us relative data for the keyword with 100 as the maximum searches in that time span. Also, it allows us to compare and extract only 5 keywords data at a time. Therefore, to extract data for all the teams we had to take multiple extractions with first 4 searches kept constant and only changing the last team name. With date as the key field we combined this data with our original viewership data sheet.

If we are taking an extraction of over a stretch of 5 years, we don’t get day by day data, but data aggregated over a week. Even then this data resonates quite perfectly with the viewership trends as we will see later.

Data Pre-processing:

Before putting everything into the model, there was a lot of data-preprocessing that had to be done. A combination of Pandas and MS Excel was used to do so. We chose a combined approach as some things were just easier to do in Excel while others using Pandas.

Firstly, we created a joint excel sheet comprising data from all three data bases, viz Cricbuzz, BARC and Google trends; Cricket_final.xlsx. This was followed by adding some features either using excel or Pandas.

Microsoft Excel

Match No. of the Series:

Entire sheet was sorted in the following order

Series is the name of the tournament

Type is the match format

Description is the name of the series as given by raw BARC (please refer to our previous blog for more details) data

Using a simple Nested IF we numbered matches


Following tournaments were rectified manually

*Manually rectified list

Asia Cup 2016

Asia Cup 2018

ICC T20 WC 2016 Qualifiers

ICC T20 WC 2016

ICC ODI WC 2018 Qualifiers


ICC Champions Trophy

Pakistan Tour of England Match 1 (No Result)


Rest of the data massaging was done using Pandas. We divided the data into three sets (ODI, T20 and Test) based on match format. Model training was done on respective datasets. We observed that this small step significantly boosted up our prediction accuracy.

Predictive Modeling:

Each sub-dataset was split into train, test and validation sets. Since the subsets were relatively of a smaller sample size we did not opt for cross validation.

Since we had a lot of categorical variables, we used Catboost for projections. We took into account the following features:

  • Channel
  • Day of the week
  • Start Time of the match
  • Match Format
  • Playing Teams
  • Venue
  • Year
  • Match No. of that series
  • Indian/Non-India Matches
  • No. of Teams participating
  • Type of the series (Bi-lateral, Tri-lateral, Asia, World, Qualifier)
  • Match Type (Opening, Regular, Semi-final, Final)
  • Total Buzz around the teams taken from Google Trends
  • Viewership data

Catboost gave us a very good accuracy on all the three sub-datasets. We opted for adjusted_R2 score as our scoring parameter. Adjusted_R2 is more reliable than R2 as it penalizes the model accuracy for any extra unwanted features. While in case of R2 accuracy will always increase for any addition in the features.

T20: Training Accuracy – 90%; Test Accuracy – 90%; Validation Accuracy – 90%

ODI: Training Accuracy – 90.8%; Test Accuracy – 90.2%; Validation Accuracy – 82.9%

Test: Training Accuracy – 90%; Test Accuracy – 90%; Validation Accuracy – 90%

These seemed to be good accuracy numbers for a start. Over and above this we observed that matches between two bowling heavy teams like was likely to be a low run rate and hence a low target match and hence a less popular match. However, matches between batting heavy teams like teams generally garnered higher viewership. Keeping this in mind final predictions were adjusted by a factor of 10%.

Finally, we used the model to predict TVRs for World Cup 2019.

Data Visualization:

Power BI was used as the dashboard platform. The dashboard explains various features and their relationships with viewership across all three formats. We already looked into some of these earlier.

Below is the snapshot of the Data model for Dashboard. Tables in the top row are loaded key tables, the ones in the middle are Fact tables and the ones in the last row are calculated key tables.

Tables “Batting_team” and “Fielding_team” are calculated fact tables, used for calculating team level performance KPIs. “match_table” is used to calculate likely winner, which per say does not have a predictable impact on the viewership was however given as an add-on.

All the data and respective predictions are as on 12.06.2019.

Please drop us a mail on hello@venanalytics.io to get a copy of the updated report along with the dashboard link.