cs_team_beta_hw2: Open Food Facts

#Data Analysis

Here we explore questions we thought could prove worthwhile. In order, we have:

What is the most sugary product, and what is its nutrition score?
Is there a relation between the amount of fat, sugar and salt in products?
Does the amount of sugar in a product depend on the brand type (low-cost, high-cost, biological etc)
How many products contain more sugar than recommended, and how does this depend on category?

Analysis: The First Investigation

Proportion of Sugars per Products, generated with Seaborn from jupyter Notebook.

For the first investigation, we would like to check if we can already see some tendances and have a look on the most sweet Products. As we can see, there is not so much surprise here, Desserts and beverages are shown on the top.

We did some spot checks, in order to have some idea about some sweety prodcuts.
Incredible !!! After a deep analysis, we discovered that the product which is containing the highest quantity of sugar, has a nutriction score of D, which is quite a bad score.

As written on the nutrition's facts, there is a high quantity of sugar.
By the way, this product is nothing else than a 1kg of Sugar. At this stage, we all love data analysis !

What could we say about this ice cream ?
Which is containing about 28gr of sugar for a portion of 100g. You are almost eating 28% of sugars !!

This Product M-Quick is quite amaizing too, a drink to cacao, which contains 80% of sugars and sold as cacao !!!

We are starting to hate Data Analysis, why our Toblerone is containing so much sugar ? If the numbers are correct, it is composed with almost 60% of sugars !

Have a look below, on Word Cloud, the Products containing the most sugars !

Proportion of Sugars per Products, generated with Word Cloud from jupyter Notebook.

Sugar, fat and salt

Sugar alone is already quite bad, but it's not the only foodstuff that has been cited in relation to health problems. Fat and salts too are easy to overconsume, and to that end we ask the question - do bad things come in threes?

Are sugary products also fatty?

Are fatty products also salty?

Are salty products also sugary?

Let's dive into it!

Scatterplot of sugar, salt and fat in products

Scatterplot of sugar, salt and fat in products, generated from jupyter Notebook.

What we see above is a set of scatterplots based on all the products that contained valid (in the range 0-100g) values for sugar, fat and salt content - roughly 99.9 percent of the original data. What this means in practice is that we have 560974 products plotted. Even with ensuring acceptable ranges however, at least two problems become immediatly apparent when observing the plots.

First, the data we are using is essentially how many percent of a product is made of sugar, fat and salt. The total of these values should never add up to more than a hundred, so we would expect all of the points on the scatterplots to remain firmly underneath a diagonal line drawn from the upper left to lower right corner.

Clearly some products have amazing properties - Markets, Soft Cherry Candy Balls for example consist of 100% sugar and 100% fat, Chewing-gum espagnol consist is 100% sugar, 100% salt.

Incredible. Absolutely incredible.

We would still like to assume that the majority of data is correctly labeled, so in order to get a more true picture of how these three values are distributed we decided to plot the density of the points rather than all the individual points.

Density plot of sugar, salt and fat in products

Density plot of sugar, salt and fat in products, generated from jupyter Notebook.

That paints, quite literally, a different picture. It appears that in general the values only range within tight boundaries. It is difficult to interpret the pictures however so we make an effort to reduce the effect of outliers by removing products with values beyond two standard deviations of the mean, resulting in dropping 12 % of the products.

Density plot of sugar, salt and fat in products without outliers

Density plot of sugar, salt and fat in products without outliers, generated from jupyter Notebook.

Nice. What we see now is how sugar, fat and salt correlates in the products. Visually it seems that most products contain around zero of all three quantities, Maybe that more sugar tend to imply less salt but honestly, it's still difficult to tell. To amend that we'll start of with trying to use the Pearson coefficent to see how well the relationships in the plots can be described by a simple line.

Pearson Coefficient matrix

Pearson Coefficient matrix, generated from jupyter Notebook.

In this first attempt at measuring the correlation we used the data for all products, and the results point to there being absolutely no relationship (indicated by values close to zero).

However, the Pearson coefficent takes outliers too strongly into account, so much like we did when we plotted the density we remove the ten percent of the data that tended to extremes.

Pearson Coefficient matrix without outliers

Pearson Coefficient Matrix without outliers, generated from jupyter Notebook.

In this second matrix some trends do appear, though they are very weak. Sugar and salt is weakly negatively correlated, with an increase in one coming at the decrease of another. Meanwhile sugar and fat, fat and salt exhibit a weak positive relationship.

Pearson is limited in that it only measures linear relationships, so for the sake of completedness we also decided to have a look at the Spearman coefficient to see if there is a rank-correlation.

Spearman Coefficient without outliers

Spearman Coefficient without outliers, generated from jupyter Notebook.

Using the Spearman coefficent revealed some more trends in the data. The negative relationship between sugar and salt came out stronger while the the weak positive relationship between sugar and fat disappeared, and the positive relationship between salt and fat appeared stronger.

Sugar and Fat, colored by salt

As a final effort to visualize how all thre variables interact we decided to create a scatterplot with the hue of the datapoints corresponding to the salt content. It is difficult to extract any meaningful relationship from this however, which falls in line with what our analysis so far seem to point at. The most we can see is that the majority of products are rather low on salt, with a few products strongly sticking out.

CONCLUSIONS

Before making any statements, let us reiterate - the data is quite suspect. While we may have been able to remove the most obviously wrong data we are most likely far from catching everything.

But let's assume that:

the range of products we have available are representative of the true distribution of products
the majority of the data is indeed correctly labeled

Then, we can see some weak relationships.

Sugary products have a slight tendency to contain less salt
Sugary products have a slight tendency to contain more fat
Fatty products have a slight tendency to contain more salt

So to answer our original questions , do bad things come in threes?

Not to any significant degree no, when taken across all products.

The answer would probably have been quite different if we'd broken the analysis down by categories, such as cake or vegetables.

Brand type and sugar content

Major producers tend to group their products under several brands. One for customers on a budget, one for those with a bit more to spend, one for those who value organic food et cetera.

What we wonder is, does the sugar content differ between these types of brands?

Are the cheap brands also more sugary? Are ecological ones less sweet? Let's have a look!

Amount of products for different types of brand labels

Amount of Products for different types of brand labels, generated from jupyter Notebook.

While our original intent was to use the ten largest producers and divide their products into the categories we were interested in, we quickly discovered that most of the data was insufficiently labeled. As an example, the biggest producer in the dataset Auchan has 5155 products. They have sub-brands such as Le Moins Cher Mmm! and Mieux Vivre Bio. In the dataset however, only two respectively zero respectively one product contains these labels. It is likely that everything has simply been grouped under Auchan.

For that reason we ended up manually going through the list of big producers until we had found five producers that seemed to have a somewhat even spread in all four categories.

The data in the table above is the aggregate of products from migros, coop, carrefour, casino and monoprix, grouped by the type of sub-brand

Probability density for sugar content across different kinds of labels

Probability density for sugar accros labels, generated from jupyter Notebook.

What we are looking at here is an estimation of how likely a certain amount of sugar is for a certain category of food.

We can see that for all categories, values peak close to zero grams of sugar, indicating that no matter the label having a low sugar content is the norm.

Based on the graph it looks like high price products are more likely to contain more sugar, showed by the orange bump around the 40g mark. The low price products also show a slight bump around 50 grams of sugar while the normal product range tend seems to contain less sugar in general, with ecological products in between.

So what does this mean? First we reiterate the caveat that the data we're using is somewhat suspect, and that the amount of data is limited. Another consideration is what kind of products end up in each brand.

It seems plausible that the high-end products contain more luxury items. Expensive sweets, wine et cetera, which could explain the more sugary content. The normal sortiment is naturally the one most likely to contain a wide range of products and is also the one we have the most data for, reducing the effect of outliers and category bias.

CONCLUSIONS

We initially held the idea that since unhealthy food tend to be cheap, we would find that the low-cost brands contained more sugar. Give the results it is difficult to make that statement, and given the low amount of data available it is probably more true that our results reflect the *type* of product that go into each brand.

Ergo, luxury products have a higher chance of being sweets, wine and the like, while canned vegetables go into the budget brands.

Overall sugar content

While exact guidelines differ and the amount of sugar that is considered bad for you naturally depends on many factors such as age, weight et cetera, there do exist recommendations. For example, products containing more than 22.5 grams of sugar per 100g should be consumed in moderation. Let's see how many products go beyond this limit.

Sugar content across all products

Once more ensuring valid ranges for the data and removing outliers we look at the distribution of sugar for all products we find that 25 % of all products go beyond the previously stated limit, and is thus considered too sugary.

We can safely assume that the majority of sugars come from sweets and the like, but we can do a bit better than just assuming. Keeping in mind that we lose 74% of all data when using categories (as seen in data exploration), let's see how sugar content is distributed in the individual categories.

Due to the messy nature of the category labels, written in different languages with different formatting and risk of overlap, we decided to only look at the tventy-five most well populated categories in the hope that they would be the least biased.

Sugar content across categories

So the figure above represents the twenty-five most popular food category labels and the amount of products in them that are considered high-sugar. Some overlap seems to still have occured with both biscuits and biscuits and cakes appearing, as well as some mix between english and french category labels. Nevertheless we can clearly see that sweets really live up to their reputation of being sweets, with around 80 percent of all cakes being considered high-sugar, to be consumed only in limited amounts.

What a travesty.

CONCLUSIONS

If one disregard the fact that the following analysis was done on only around twenty percent of the data, that categories risk being overlapping, differently formatted (or just plain wrong), it seems that things are as one would expect. Delicous things like cake, biscuits and sweets tend to be too sweet by a lot, lending weight to the old adage "too much of a good thing... is a bad thing". Somewhat surprisingly breakfast products seem to be too sweet, but that might be attributed to culture more than anything else. Most of the data comes from France, so pain au chocolat, jam and sweet cereal might dominate here. Had more data been provided from more countries in the world we would have obtained a more nuanced picture of whether or not breakfast really does tend to be sweet.

So in summation, sweets should only be consumed in limited amounts and the french seem to like their breakfast sweet.

For those interested, the exact distributions for the different categories are plotted below

Individual category distributions

Eat less sugar,you're sweet enough already

#Data Analysis

Analysis: The First Investigation

Sugar, fat and salt

CONCLUSIONS

Brand type and sugar content

Major producers tend to group their products under several brands. One for customers on a budget, one for those with a bit more to spend, one for those who value organic food et cetera.

What we wonder is, does the sugar content differ between these types of brands?

Are the cheap brands also more sugary? Are ecological ones less sweet? Let's have a look!

CONCLUSIONS

Overall sugar content

CONCLUSIONS

Eat less sugar,
you're sweet enough already