How to evaluate the fuel efficiency of vehicles?

Jiaqiang Yi
5 min readJan 4, 2021

One of the methods to evaluate the efficiency of vehicles is their mile per gallon (mpg) values, the average mile one vehicle can drive with one-gallon oil. However, the mpg for highway and city road may not be the same for the same vehicle. Besides, based on our intuition, the structure, manufacturer, size, and so on could also have an influence on the fuel efficiency of a vehicle, which all been listed as the dataset.

Therefore, it is essential to have a regression model to predict the expected mile per gallon (mpg)for different types of cars. This model could be potentially used in the design of vehicles and the implementation of environmental standards.

The three questions are listed in the following to explore how different parameters influence the fuel efficiency of vehicles:

  • Question 1. Which manufacturer produces the most fuel-efficient fleet of type-1 cars?
  • Question 2. Which manufacturer produces the most fuel-efficient fleet of type-2 cars?
  • Question 3. Build a model to predict city mpg (variable “UCity” in column BG).
  1. Exploratory Data Analysis (business and data understanding)

The fuel economy data is directly taken from the FuelEconomy.gov from the U.S. Department of Energy. It can be accessed through the following link:

The description of the vehicle dataset
RangeIndex: 40081 entries, 0 to 40080
Data columns (total 83 columns)
dtypes: bool(1), float64(32), int64(27), object(23)
memory usage: 25.1+ MB

After conducting exploratory data analysis, the first two questions can be answered without further data process. Their fuel efficiency can be evaluated with combined mpg. Vehicles are divided into type1 and type 2:

Fuel type 1. For single fuel vehicles, this will be the only fuel. For dual-fuel vehicles, this will be the conventional fuel.

Fuel type 2. For dual-fuel vehicles, this will be the alternative fuel (e.g. E85, Electricity, CNG, LPG). For single fuel vehicles, this field is not used

Question 1. Which manufacturer produces the most fuel-efficient fleet of type-1 cars?

The maximum mpg is 136 for type 1, and the car is manufactured by Hyundai. Therefore, Hyundai produces the most fuel-efficient fleet of type-1 car.

Questions 2. Which manufacturer produces the most fuel-efficient fleet of type-2 cars?

Following the same method, we can know that the maximum mpg is 133 for type 2, and the car is manufactured by Toyota. The Hyundai vehicle is slight more fuel-efficient than the Toyota vehicle.

2. Build a model to predict city mpg (variable “UCity” in column BG)

Data were further processed to build up the model to predict the city mpg. One of the major issues is the Null value of the data frame. If there are over half Null values in one column, this feature may not be a useful parameter to predict city mpg. Therefore, they can be tentatively dropped from the dataset. Then, the null values of float and int data types can be filled with the mean and the null values of categorical columns can be kept as it is.

In addition, the distribution of UCity has been shown in the following picture, with most of the data between 5 and 50. However, there are also some UCity with a value of 0 or over 100, which could be outliers and need to be further studied.

The distribution of UCity

Model Selection

Linear regression, Supporting Vector Machine (SVM), and Decision Tree models have been selected for the model training. Regression Evaluation Metrix has been used to evaluate the performance of models. All three models have been trained with only numeric data as well as both numeric and categorical.

Models based on the numeric data (float and int)

I first built the model with Linear regression, svm.svr, and decision tree based on the numeric data (float and int). Then, they achieved the following scores. The performances of Linear regression and decision tree are pretty good. Both r2_score and explained_variance_score are close to 1. However, the performance of SVM is poor, and it takes a long time to train the model.

The evaluation metrics of Linear regression:
r2_score: 0.9974470837233627
explained_variance_score: 0.9974472426894514
mean_absolute_error: 0.35476214756678076
mean_squared_error: 0.2655512812407401
The total time for the model training: 0.1297307014465332 S.
The evaluation metrics of SVM:
r2_score: 0.009908238103526701
explained_variance_score: 0.03476079425642109
mean_absolute_error: 4.833380582896554
mean_squared_error: 102.98815449750074
The total time for the model training: 186.66455221176147 S.
The evaluation metrics of decision tree:
r2_score: 0.9984326398405395
explained_variance_score: 0.9984328746718416
mean_absolute_error: 0.12323890291891106
mean_squared_error: 0.16303491905290152
The total time for the model training: 0.5630803108215332 S.

Models based on all features

To enumerate categorical data, the get_dummy function has been used to build up the model. Then, the data is applied to linear regression, svm.svr, and the decision tree model. The performance of linear regression has become much worse, while the performance of the decision tree has been improved a little bit. Because the size of data was increased ten times, svm.svr didn’t output a result.

The evaluation metrics of Linear Regression:
r2_score: 0.2541095003017706
explained_variance_score: 0.25436577947834293
mean_absolute_error: 0.6249547780811898
mean_squared_error: 77.58663285309869
The total time for the model training: 82.65471315383911 S.
The evaluation metrics of decision tree:
r2_score: 0.9987153910689179
explained_variance_score: 0.9987157017418185
mean_absolute_error: 0.11498840217996291
mean_squared_error: 0.13362347628237908
The total time for the model training: 11.55288028717041 S.

In addition, after applying the get_dummy function, the size of the dataset has been increased from 83 columns to 4806 columns as shown in the following. It adds a huge burden to model optimization while not improving the performance of models. Therefore, categorical data can be dropped for this model training.

The description of enumerated vehicle dataset
RangeIndex: 40081 entries, 0 to 40080
Columns: 4806 entries, barrels08 to VClass_Vans, Passenger Type
dtypes: bool(1), float64(32), int64(27), uint8(4746)
memory usage: 199.5 MB

Conclusion

The first two questions can be answered without modelling. The types of vehicles could impact their mpg. The main points about the model have summarized in the following:

  • The Linear Model and Decision Tree have both good accuracy and efficiency with numeric values.
  • Only numeric values are good enough to train the model with the r2_score of over 0.99.
  • All features can make the Decision tree better but Linear Model worse.

The detailed notebook and following updates for the project can be accessed with this link:

https://github.com/Datajacker/fuel_efficiency_analysis

--

--

Jiaqiang Yi

I am a data scientist at a retail consulting firm in Toronto with a solid background in science and engineering.