Data Visualization with Python
Final project to Coursera’s data visualization with Python course.
Data visualization is trying to understand data by representing it in a visual context so that the trends and correlation between the features in the dataset can be clearly expressed. Python offers excellent libraries that comes with a lot of feature for visualizing data.
The course Data visualization with python by coursera focuses on the following concept:
- Data visualization and some of the best practices to keep in mind when creating plots and visuals.
- The architecture of Matplotlib.
- Basic plotting with Matplotlib.
- How to read csv files into a pandas dataframe and process and manipulate the data in the dataframe.
- How to generate line plots using Matplotlib.
The final assessment for the course follows a survey conducted to gauge an audience interest in different data science topics namely;
- Big Data (Spark / Hadoop)
- Data Analysis / Statistics
- Data Journalism
- Data Visualization
- Deep Learning
- Machine Learning.
The participants had three options for each topic: Very Interested, Somewhat interested, and Not interested. And 2,233 respondents completed the survey.
Here is the link to the survey result.
Now let’s get started with the questions!
The first questions says to use the pandas read_csv method to read the csv file into a pandas dataframe. But before then, we import all the necessary libraries that will be needed for visualization in this project;
Now starting with the first question. We read the csv file into a dataframe
The output of data.head() which shows the top 5 rows is shown below:
Note: We use the index_col parameter in order to load the first column as the index of the dataframe.
After this, the next question says to use Matplotlib to visualize the percentage of the respondents’ interest in the different data science topics surveyed.
The first thing to do before showing our chart here is to convert the numbers into percentages of the total number of respondents. Recall that 2,233 respondents completed the survey.
The output is shown below:
Now we can create the bar chart:
We’ll be using a different dataset for the question that follows. This dataset is a San Francisco crime dataset that mainly shows the neighborhoods in San Francisco along with the corresponding total number of crime cases in each neighborhood. And the data set can be gotten here.
The first question here says to convert the San Francisco dataset into a pandas dataframe that represents the total number of crimes in each neighborhood.
The last question for this project requires a Choropleth map to visualize crime in San Francisco.
To create the choropleth map,a GeoJSON file that marks the boundaries of the different neighborhoods in San Francisco is provided. The code for the choropleth map is shown below;
This post runs through the questions in Coursera’s Data Visualization with Python course.
I hope this is clear enough, and becomes helpful to someone 🙂.