Overview
The UK has a ban on petrol and diesel cars that is set to take effect in 2030 as well as strict emissions standards. One such standard involves a tax that is relative to the amount of exhaust emissions, so more emissions lead to more cost for the customer (BuyACar Team, 2021). This makes it important for car manufacturers to be able to predict emissions of their vehicles to help reduce costs for their customers. This model will help manufacturers tell which characteristics of their vehicles to improve to reduce emissions. This project team consisted of Hakeem Mohamed, Elliott Sax, and I. Utilizing vehicle data from the European Union (EU), we created a machine learning algorithm to determine what vehicle characteristics have the most impact on carbon dioxide (CO2) emissions and determine which characteristics lead to higher volume output.
Some key insights include:
Using 8 variables in a ridge regression model gave us the best R-squared score of 77%.
Fuel type had the largest positive correlation with emissions with a coefficient of 6.21.
Mass did not have as much of a positive correlation with emissions as we hypothesized with a coefficient of only 0.029.
The Data
This dataset can be found on Kaggle at the following url: https://www.kaggle.com/datasets/vivovinco/monitoring-of-co2-emissions-from-passenger-cars . The original dataset contains 33 columns and over 1 million rows. The abbreviations used in the dataset are listed below.
ID : Identification number
MS : Member state
Mp : Manufacturer pooling
VFN : Vehicle family identification number
Mh : Manufacturer name EU standard denomination
Man : Manufacturer name OEM declaration
MMS : Manufacturer name MS registry denomination
TAN : Type approval number
T : Type
Va : Variant
Ve : Version
Mk : Make
Cn : Commercial name
Ct : Category of the vehicle type approved
Cr : Category of the vehicle registered
m (kg) : Mass in running order complete vehicle
Mt : WLTP test mass
Enedc (g/km) : Specific CO2 Emissions (NEDC)
Ewltp (g/km) : Specific CO2 Emissions (WLTP)
W (mm) : Wheel Base
At1 (mm) : Axle width steering axle
At2 (mm) : Axle width other axle
Ft : Fuel type
Fm : Fuel mode
ec (cm3) : Engine capacity
ep (KW) : Engine power
z (Wh/km) : Electric energy consumption
IT : Innovative technology or group of innovative technologies
Ernedc (g/km) : Emissions reduction through innovative technologies
Erwltp (g/km) : Emissions reduction through innovative technologies (WLTP)
De : Deviation factor
Vf : Verification factor
r : Total new registrations
Analysis
We started by cleaning the data. Unnecessary columns were removed. We determined that all electric vehicles should be removed from the dataset. This was because they do not have CO2 emissions and we were specifically interested in its relationship with the variables. This left us with over 12,000 rows of data.
We started by creating scatter plots with each variable versus the CO2 emissions.
Using all eight of the non-categorical variables that showed a relationship with CO2 emissions gave us an R-squared score of 77%. This was achieved with a ridge regression model. The variables used were as follows: fuel type, fuel mode, mass, wheelbase, steering axle, other axle, engine capacity, and engine power. Upon beginning this project, we set a standard R-squared score of 80% to achieve. Fuel type had the greatest effect on the score with a coefficient of 6.21. We tried different combinations of variables, but all the scores were lower without all variables being used. The code used is pictured below. Please note: the R-squared score pictured was not our highest due to retraining and testing the model multiple times.
Initially, we began testing the model with the variables showing a positive relationship with CO2 which included: mass, wheelbase, steering axle, other axle, engine capacity, and engine power. The make and category variables do not affect emissions, so they were left out. After testing with these variables did not yield an 80% R-squared score, we added fuel type and fuel mode. These eight variables gave us the highest R-squared scores across the models we tested with 77%.
Besides the ridge regression model, we tested multiple linear regression, single linear regression, and lasso regression. See below for results.
To achieve further insights, we created several visualizations which can be seen below.
Our full dashboard and story can be found on Tableau Public here: Story | Tableau Public.
Final Thoughts
As might be expected, petrol (unleaded fuel) had the highest CO2 emissions followed by diesel, liquid propane gas (LPG), and hybrid. Cadillac and Chevrolet had the highest and lowest CO2 emissions respectively. However, upon further review, we determined that these were outliers due to not many of these makes being sold in the EU or in this dataset.
Automobile manufacturers in the EU should definitely focus on hybrid and electric vehicles for the best emissions scores to save their customers the most money. However, they can also make improvements to other aspects of the vehicles that may improve scores with petrol while they work on increasing electric and hybrid inventories.
For further information and code, please visit our project GitHub page: https://github.com/two-suns/CO2_Emissions_Estimator . Please contact me with any questions or concerns. If you liked this, please connect on LinkedIn, thank you.
Comments