Females in media

A data-based exploration of female representation in media. Let's dive in

"Improving diversity and pluralism in the media also means providing greater opportunities for people."

- UNESCO, 2019



With gender equality being raised in a wide variety of domains, a data-backed analysis can help our society understand its current position and the steps it needs to take to improve. This project aims at analysing the status of females through the lens of media.

By using the Quotebank dataset, a corpus of english quotations from a decade of news, the project provides insights on how gender is represented. The data in this project covers quotes published between 2015 and 2020. The site names at the origin of the quotes were extracted from the URLs of the article, which was provided in the original dataset. Based on this list, which uses Google Page Rank and other independent web metrics for various search engines (more about the ranking method here), 116 sites were selected based on their web ranking scores. Only quotes whose source was within this list were kept for the study. This filtering allowed to reduce the media sources to the most known and common journals or sites.

6
years of data
2015 to 2020
41.5
millions of quotes in total
21
media countries represented
116
media kept
7
categories analysed
Annual percentage of quotes

With this data, an in-depth analysis is performed to understand:

"
How does female representation in media vary in different countries?


How does female representation evolve in time?


Do different types of media sources represent females equally?


What topics are females quoted in?

"

Is gender equality in the media a universal concept?

How are females quoted around the world compared to males?

The average percentage of female quotes for each media's country of origin from 2015 to 2020 provides a preliminary glimpse into the different countries and the female representation in their respective media. At a first glance, media from France, Israel and Latvia have a low female representation compared to Germany, United Arab Emirates or Brazil. However, this note is to be taken with caution, as non-english speaking countries have only a handful of media which focus on precise categories. This induces a bias in the gender representation for those countries. To illustrate this effect, with Eurosport being one of France's only 2 media sources, this could explain the low percentage of females as sports is more mediatized for males.

English-speaking countries, on the other hand, have a greater number of media and hence cover a wider range of categories. Interestingly, the United Kingdom has the largest propotion of female quotes compared to all other countries with 23.7%. South Africa appears to quote, proportionally to its total number of quotes, more females than the United States. As a result, a more in-depth investigation is conducted, focusing on these three countries, in order to determine the causes of these patterns.

There is no country that quotes more females than males



Comparing female representation in 3 countries: UK, USA and South Africa

Note that the percentage of quotes per media country is 12.7%, 86.7% and 0.7% for UK, USA and South Africa respectively. For that reason, the relative numbers per countries are to be used.

What is the evolution of the percentage of female quoted accross the years?

The percentage of female quotes gradually increases for the UK, peaking at 26.8% in 2019, while it remains relatively steady in the other two countries. Surprisingly, an abrupt decrease appears in 2020. This could be in part due to the the reduced number of quotes in 2020 in the dataset.

The changes over time confirm the first perceptions about the countries: the United Kingdom, which has the greatest proportion of female quotes overall, also leads throughout the years. South Africa has slightly higher proportions than the United States for all years but 2016. This year marked the presidential election with the famous rivalry between Donald Trump and Hillary Clinton which was omnipresent in the media. Indeed, it may be interesting to focus on the distinct speakers and their individual impact on gender representation in the media. An average for each year does not account for the diversity of quoted females, and one unique female may be very often quoted, skewing the prior analysis.

Fly over the nodes to have more info on the age and exact number of quotes

Who are the most quoted people?

The speakers are now ranked according to the number of times they were quoted. The interactive network shows the top three speakers for each country and gender. The size of a node is proportional to the number of quotes from the speaker in the country in which he or she was cited, and the colors represent gender.

For 100 female quotes, 10 originate from distinct speakers in South Africa, but only 5 and 2 for the United Kingdom and the United States of America, respectively

Here again, the unequal gender representation is unambiguous, since within each country, all the male nodes are bigger than the female nodes. Donald Trump, as former president of the United States, was present on an international scene and in many different topics. He consequently appears in the top three of all countries. The other main speakers in the USA are all politicians, with an exception for Pope Francis. The most quoted females in the USA are Hillary Clinton and Nancy Pelosi, who are both females above 70 years old. However, in the UK the most quoted female speakers tend to be celebrities or entertainment figures, such as Meghan Markle and Katie Price, who are younger females. On the other hand, all the males in the UK work in politics. In South Africa, the most prominent speakers are once again all political figures. There, the three most quoted females fall in the age range of 50-70 years old. With these results in mind, one can see how the categories as well as the age seem to determine the proportion of females within each country.

Does age determine who will be quoted?

The age distribution for both genders is displayed above, depending on the country. The distinction between countries is obvious, and the distinction between genders even more so. As one would expect, the age distribution of the males is not surprising, with most speakers belonging to the active workforce. Females, on the other hand, appear to have a different age distribution depending on the country. In South Africa, 37.8% of the females mentioned are between the ages of 60 and 70, which is comparable with the top three female distnict speakers from the prior network. In the United Kingdom, the distribution is significantly skewed towards younger females. This is also supported by the network above, with two young females appearing in the top three for the UK, contrarily to the other two countries. The top three however cannot justify these observations in the age distribution on its own. Indeed, since the proportion of distinct female speakers (see the box above) is much lower for the USA than for South Africa, one could expect Hillary Clinton and Nancy Pelosi to have a big influence on the age distribution of American females. This is however not the case. A possible explanation for this contradiction is the fact that, despite having quoted some females only, proportionally to their total number of quotes, these females are very well distributed over the age intervals. Indeed, the first and last 25% of the age distribution of American females are between 0-34 years and 62-100 respectively, with a median at 47 years old.

What are the media sources behind the quotes?

The previous analysis shows that the representation of females is influenced by the country of origin of the media. The table below shows the different media that quoted the top speakers in each of the analyzed countries. Looking at the various media leads us to dig deeper with the next part investigating the media sources classified by how reliable and how popular they are.



How does respected media compare to popular media?



Which media are considered respected or popular?

Different media sources are likely to have different tendencies in the content they create, the news they cover and consequently the quotes they use. Readers also perceive the quality of news and articles in function of the media source. In order to explore whether prejudices on media sources are truly reflected in real life, this section compares the representation of females in respected and popular media. The respected media are known to share news stories backed by facts and deemed reliable, while the popular media instead contain the media pages that are most often visited, but not necessarily "respected". A subset of quotes is analysed and quotes originating only from a shortlist of media sources are kept, as presented in the table. These media sources were selected according to the following sources: respected, and popular.

And how do they evolve over the years?

At a first glance, how do these two groups of media differ in female representation through quotes? While one would hope that respected media would perform better in terms of equal gender representation, this figure shows that it is not the case. In fact, a gap is present between popular and respected media every single year.

The increase in proportion of females over the years in either group is encouraging, with the drop in 2020 likely present due to the reduced number of quotes collected for that year. The increasing trend surely reflects the awareness on gender issues that has increasingly been voiced and mediatised.

No media source is even half-way close to gender equality.




How does the trend break down within a group of media?



Within the respected or the popular media, not all the sources present the same proportion of female quotes, which range from 12% to 25%. Reuters, a source considered to be respected, presents the lowest proportion of female quotes. Other respected media such as Washington Post, BBC and Politico behave similarly, starting at a low proportion of females in 2015 and gradually increasing throughout the years. Within the popular media, sources such as CNN have higher proportions early on (18.5% in 2015) and continue to increase. It is also the popular media that present the quickest increase: msn goes from 14.6% in 2015 to 25.2% in 2018. It is interesting to note that the proportion differs more within respected and popular media rather than between them.

But what do females talk about?

When thinking of the possible differences between "respected" and "popular" media, the differences in categories covered come to mind. Indeed, the type of news presented by the sources differ thematically, politically and in terms of audience. The overall proportion of females does not give insight into what they are quoted for and how this differs between the respected and popular media.

By analysing keywords present in the URL of the article or page containing the quote, the quotes themselves are sorted into different categories of news. Seven main categories were predefined and indicator words allowed to classify the quotes into these various categories. Only approximately 50% of the quotes were attributed to a category, and the quotes not belonging to any on the predefined topics were ignored for this analysis. Analysing categories of quotes allows to see if there are trends between genders based on topics, and help explain the results observed for the proportion of females depending on the media source.

In 2018, over 50% of quotes in entertainment were by females, for popular media.

Rather unsurprisingly, the proportion of female quotes is dependent on the category of news, regardless of the type of media. The repartition of categories among the genders appears to follow stereotypes: very few females are quoted in sports, while entertainment and culture present higher shares of female quotes. Media sources such as msn and Yahoo are expected to post more about entertainment and celebrity news than Politico for example. These are categories where females typically appear more, thus explaining the higher proportion of female quotes achieved by the popular media. Particularly in 2018, when more than 50% of the quotes in the entertainment category originated from females. This maximum is the only occurence of a majority of female quotes.


“If my dad doesn't walk her down the aisle, then I will."

- Meghan Markle, Duchess of Sussex
(1077 repetitions)

“Beyoncé, I was hurt because I heard that you said you wouldn't perform unless you won Video of the Year over me and over Hotling Bling"

- Kanye West, Rapper and record producer
(475 repetitions)


"... isn't whether machines can think, but whether human beings can still feel."

- Manohla Dargis, American film critic
(165 repetitions)

“Trump took credit for no one dying in a plane crash this year! That explains his new campaign slogan, `Trump 2020: You got to Tulsa, didn't ya?”

- Stephen Colbert, Television host
(331 repetitions)


Some speakers' quotes are recurrent in several media sources, such as Meghan Markle's most famous quote with 1077 repetitions. If certain sites produce several articles on the same story and use the same quote, this will increase the overall proportion of female quotes. The numbers analysed until now therefore do not reflect how many distinct females are quoted compared to males. Will the numbers change if the proportion of distinct females to distinct males?

Only 23% of distinct speakers are females for both popular and respected media.

Is the distibution of distinct speakers different for each gender?

The number of distinct speakers is used to determine the diversity of speakers represented within each gender. The graph to the right depicts the number of males who have been quoted for a single female. This conclusion is obtained by dividing the number of male speakers by the number of female speakers in both respectable and popular media. It is interesting to note that for every female, at least two males are quoted; this ratio rises for topics like politics, science, business, and sports. The ratio varies more between categories than between popular and respected media, indicating that the topic is more influential on how females will be quoted.

This deduction also brings a new perspective to the results obtained previously with regard to the proportion of females in popular entertainment. Whereas females appeared to be quoted more in entertainment by popular media, there are still twice as many distinct males for every female. Hence, the same women are repeatedly quoted and makes the overall proportion of female quotes increase. In fact, there are not truly over 50% of females in this category. This reduced number of distinct females could indicate that there are more barriers for females in media, making it harder for them to be quoted compared to males.

The diversity of male speakers always remains greater than that of females. The same females tend to be repeated more often.

Less distinct females are present in the dataset, but when they are, are they given the same "space"? It can be interesting not only to look at how the proportion of quotes by females varies in categories and media type, but also analyse changes in the time allocation they are given. A simple metric for this is the quote length. Taking the categories that tended to quote more equally females and males, i.e. entertainment and culture, the quote length can be compared. Are there any differences in the quote length regarding these categories? And what about the quote length depending on the type of media?

How is time allocated between the genders?

Indeed, the plot to the right helps figuring out if males get more time to express themselves than females. Females have a propensity to hold longer quotes in the popular media than males, although the pattern is the opposite in the respected media covering culture. Despite the fact that the focus was placed on these two categories, the trend of females holding shorter quotes than males holds true across all years and categories.

How do the media sources differ in female representation within culture and entertainment?

Entertainment and culture were the two categories where the greatest proportion of females was observed. Do sources within respected and popular media have any differences in these precise categories? Apparently, yes. Even within respected or popular media, the proportions can vary notably. The overall highest proportion within popular media observed before can be detected here again. For culture, the proportion of females in respected media even has a tendency to decrease with time, such as for Reuters or Politico.

Within entertainment, there is more variation from one year to another, making a specific trend less evident. For the sake of illustration, BBC present in both respected and popular media jumps from 15.8% to 35.7% in one year before decaying again to 13.9% in 2018. On average, the respected media do seem to have lower proportions of female than the popular media. For example, msn has above 50% of female quotes certain years, while the threshold of 40% is only exceeded in one case for the respected media: 46.3% for New York Times in 2019.

One media source in particular stands out: Reuters. While other media sources increase and decrease, it is the only source that has a net decline in female proportion for both entertainment and culture. However, looking at the proportion of female quotes in Reuters globally and not within a category, the numbers only increase with time (see plot here). Could this mean that more females in Reuters are being quoted in topics other than entertainment and culture?

Why does the porportion of female in Reuters decrease in entertainment and culture?

If the proportion is decreasing in entertainment and culture, but the whole average is increasing, then females must be quoted more in other domains for Reuters particularly. The previous analysis was performed by providing a list of topics and sorting the quotes accordingly. This could lead to a bias, as not all possible categories are considered. Furthermore, the category was inferred from the URL of the source of the quote, supposing that main keywords or the type of news is included in the URL itself. Again, this might not always be representative of the quote itself.

By using natural language processing, quotes are tokenized, lemmatized, entities are detected, stopwords are removed and bigrams are included. A Latent Dirichlet Algorithm (LDA) is applied to a subset of this data. Reuter's quotes are analysed for the year 2015 and 2017, for both genders, and words from the quotes are clustered into 5 topics through unsupervised learning.

Clusters of topics are grouped into main themes, as shown in the table above. This allows to observe the change in the themes most mentioned by males and females in 2015 and 2017. In 2015, females were not very present in topics like geopolitics. Many terms fall in a more vague theme, such as "feel”, “think” and “woman”. Two years later, less of these terms are present for females, giving space to new and more pertinent terms such as “law, “North Korea” and “deal”. This could help explain the trend of decreasing proportion in entertainment and culture in the considered media. For males instead, the distribution changes less, with most of their quotes falling within politics, economics and geopolitics. In fact, in 2015 terms such as “attack”, “force” and “Iran” appear for males as important terms. In 2017, similar terms alluding to the same themes can still be found, between them “policy”, “security” and “China”.

WHAT DOES THIS DATA STORY TELL ABOUT GENDER EQUALITY?

Throughout this project the reader was guided in the unequal distribution of gender representations in the media. This is not only a sad reality but also one with heavy consequences for aspiring future generations. Despite encouraging improvements over the years, gender equality in the media is by far not yet reached.

  • Younger females are more quoted than elder ones in the UK
  • Females are mostly quoted in culture and entertainment
  • They are the least quoted in topics such as sports, business and science
  • At least two distinct males are quoted for every female

The team

Here is the team behind the project:

Lisa Laurent
EPFL-Energy
Sélène Ledain
EPFL-SIE
Lavinia Schlyter
EPFL-CSE
Arthur Waeber
EPFL-Energy

Click here for additional information