Eat less sugar,
you're sweet enough already

cs_team_beta_hw2 - Automn 2018

Open Food Facts

Home
Data Exploration
Data Cleaning Data Analysis Data Game

#Data Exploration

Before starting to perform the analysis for our research questions, we decided to understand the content of Open Food Facts. Indeed it's important to have an idea about how the data have been collected, by whom, when and in which way, in order to have a better understanding of the underlying data. This approach may help us to understand the Data and evaluate the accuracy of it.

To perform this Data Exploration, we decided to focus on the overall information about "Contributors", "Brands", "Countries" and "Categories".




Understanding the Data: Overall information about Contributors


Top 10 Creators

Distribution of Products by Creator, generated with Seaborn from jupyter Notebook.



Creatorscountspercentage
kiliweb299'82043.987547
usda-ndb-import169'86124.920848
openfoodfacts-contributors82'71912.135968
date-limite-app17'4532.560585
openfood-ch-import11'4591.681186
tacite8'4381.237966
sebleouf8'3591.226376
tacinte5'2820.774939
stephane2'7740.406982
javichu2'7660.405809

681'599 records have Creator, 3 records do not have any Creator (0.0004%)


More than 80% of the Products have been created by the first 3 Contributors. We could almost consider individual contributors as marginal.

So far we can see that Kiliweb is the most active contributor and is continously adding products. usda-ndb-import, which is the 'United States Department of Agriculture' seems to not contribute anymore to the Open Food Facts Database and has been only imported once. The open community of openfoodfacts-contributors seems to be growing day after day.

Evolution of added items per Contributor


The diversity of contributors may lead to mismatched data. Most of the contributors could be considered as unknowns, and data accuracy could therefore not be checked/validated. Furthermore, most of the contributors simply add products they find in their home country, leading to the risk of duplicate/similar products in the database. We can already assume, based of the main contributors, that the diversity of products by country is not representative and comparison against country is most probably not relevant enough.





Understanding the Data: Overall information about Brands


Top 10 Brands

Distribution of Products by Brand, generated with Seaborn from jupyter Notebook.



Brandscountspercentage
Carrefour5'3840.789904%
Auchan5'3130.779487%
U4'4010.645685%
Casino3'1300.459212%
Leader Price2'8250.414465%
Cora2'2300.327170%
Meijer1'9970.292986%
Kroger1'6730.245451%
Picard1'5170.222564%
Ahold1'3700.200997%

There is a total of 681'602 records, and 2'211'773 of them are empty (32.44958%)


32% of the products are not assigned to any Brands. Furthermore the 6 biggest brands of Open Food Facts were all French malls. It sounds strange that majority of all the products are reported from France. Maybe the popularity of French cheese, like Product Code '3245390058782' Petit Munster Géromé, is the reason behind the difference? Could this be an avenue for analysis!?



List of French Cheese, by http://goatcheesesoffrance.com/the-cheeses/#chevre-regions


It's true that there is a lot of cheese in France and that some of them are really famous , as illustred above or in Wikipedia !

There is unlikely no correlation with cheese and the number of French Products. The number of French Products could be most likely explained by the fact that the biggest contributor is a French Agency, called Kiliweb. This may explain why there are so many products attributed to French Brands and France.





Understanding the Data: Overall information about Countries


Top 10 Largest Countries

Distribution of Products by Country, generated by Seaborn from jupyter Notebook.



Countriescountspercentage
France409'95160.145217%
United States173'49425.453857%
Switzerland13'5431.986937%
Germany11'6291.706128%
Spain6'1040.895537%
France, Germany5'8840.863260%
United Kingdom5'6430.827903%
Belgium5'3090.778900%
France, Switzerland4'3450.637469%
Belgium, France3'5050.514230%

681'058 records have Countries, 544 records do not have any country (0.07%)


Explicitly looking at country instead of brand we see the same behaviour as in the previous analysis. Most of the products are attributed to France which falls in line with the main contributor being kiliweb, a France Web Agency, and the most popular brands analysed above.

Even if most of the products have been assigned to country, as only 544 products are not linked to any country, we discover that there is 1361 countries.

So far it seems that countries contain a list of countries and not unique countries.

Furthermore most of the products are linked to one or two countries ONLY, which is very suspicious.

We could assume per example that this Peanut Butter assigned to France could be founded in another country like USA as it is made by Kroger.

Most probably this product, which is being sold in France, had been repertoried as French's product by a French Contributor. But we could be sure that this peanut butter is sold in USA too, maybe in a different product's name.






Understanding the Data: Distribution of Products by Countries


Distribution of Products by Country

Distribution of Products by Country, generated with Folium from jupyter Notebook.


Even if the Country information isn't well structured, we proceeded with some cleaning activities.

For this purpose, we tried to create a Country Map with folium, and performed following actions:

  • Load a files containing a list of countries and ISO Three Letter Code
  • Clean/Parse the country Name from openFoodFacts
  • Map the normalized country with a proper Country's Name
  • Show a folium Map and map the value through iso 3 Letter Codes and retrieve the percentage accordingly


It's clear, by looking on the map, that data for only few countries have been submitted. Based of this, we can say that most of the products are assigned to

  • France with 409'951 products (~60%)
  • United States of America with 174'494 products (~25%)
  • Switzerland with 13'543 products (~1.98%)
  • Germany with 11'629 products (~1.70%)
  • All others Countries have less than 1% each of representativity

So far we can conclude that this field 'country' has to be used with precaution.





Understanding the Data: Overall information about Categories


Top 10 Largest Categories

Distribution of Products by Categories, generated with Seaborn from jupyter Notebook.



Categoriescountspercentage
Plant-based foods and beverages37'3125.474162%
Beverages25'3683.721820%
Sugary snacks23'5883.460671%
Dairies15'5802.285791%
Meats9'5891.406833%
Groceries9'3901.377637%
Meats7'6691.125143%
Spreads4'3550.638936%
Frozen foods3'0930.453784%
Fruit juices3'0650.449676%

173'722 records have Categories, 507'880 records do not have any Category (74%)


It seems that 74% of the Products(507'880) does not have any 'category'. Furthermore the first and second categories both contain 'Beverages'.

So far we can conclude that this field 'category' has to be used with precaution.





Move back to Home Move next to Data Cleaning