Sneak view on Pandas, MatPlotLib & SeaBorn essentials aspects

aditya goel
29 min readApr 11, 2021

First step to Pandas :- Anaconda comes with a lot of packages that you need for data science. In this course, we will also use Conda, a package manager. In case, you currently already have a version of pip or Conda installed, then you don’t need to reinstall everything. To complete a brand new installation of Anaconda, head over to www.cotinuum.io/downloads. Once the installation is done, the same can be opened through explorer :-

Starting the Jupyter-Notebook :-

Once we open the Jupyter-Notebook, we can code in it using pandas. Here is how, you can see the version of Pandas library :-

We can also get the doc-string for any of the method available in pandas library. For example, below we can see the Docstring for ‘read__csv’ method :-

Pandas DataFrames :-

Dataframe is like a 2-dimensional array. It can be considered like a table or a spreadsheet with rows and columns. Here is an simplest example of the data-frame, which have pre-defined heading columns. Each row corresponds to the medal that an athlete had won.

Taking Input into DataFrame :-

With Pandas, we can read data stored in a wide variety of formats, such as excel, json, or SQL database tables, amongst others. Since our file is a CSV file, we will use read_csv which will allow us to read a comma separated file into a DataFrame. Read_csv has numerous parameters and is very feature-rich. we can specify a wide variety of options including what column headings to use for series, additional delimeters for files, and ways to parse any data fields. The required parameter to read_csv is the path to the CSV file. The skiprows parameter allows you to do exactly that. If the first few lines do not hold any relevant data, then skip them.

olympicsDataFrame = pd.read_csv(‘/Users/b0218162/Documents/LEARNING/PYTHON/Pandas-Essential-Training/data/olympics.csv’, skiprows=4)

Viewing the DataFrame :-

The head and the tail return the first and last N rows of a DataFrame. This is one of the first steps after reading a file into a Pandas DataFrame just to get a feel of the rows and columns involved. By default, we will get the first five rows and the last five rows if we don’t specify any value of N.

A helpful tip is that sometimes if you sorted your DataFrame or series, you might only want to display the top three results. You can do that by typing olympicsDataFrame.head and three and that will give us the top three rows.

Another example of where we might want to use the head method is when we have just read in a CSV file into our DataFrame and then we want to view it. So, an example of where the head is very helpful is here. I would normally run head straight after reading the CSV file into the DataFrame. Here I can check very quickly that the DataFrame is in the format that I would expect. Note that, here, ‘olympicsDataFrame’ is the name of pandas data-frame we have used. In order to view the first few rows, we use the ‘head()’ method for the same.

In order to see the entire data-frame, we can just do this by using the name of the data-frame. This shall show-case the first few rows and last few rows of the data-frame. So, below we can see that, there are in-total 29,216 rows in the data-frame.

Pandas Series :-

A Series is a one-dimensional array of indexed data. One of the huge benefits of Pandas is that it supports both integer and label-based indexing, and provides a host of methods for performing operations involving the index. The first column is known as an index. And, each of the other columns corresponds to a Series. Example → The first Series is called City, where the Olympics were held. The second Series is called the Edition.

Please note that, each of the rows is also a Series, and this can be identified by this index number.

We can access the Series, using a square bracket, using either a single quote or a double quote, notation Or you can use the dot notation. The square-bracket notation for accessing a Series, will always work. The dot notation can be a shortcut, but it will not work if there’s a space in the Series name. So, if you want to use the dot notation, you need to make certain that each of the column names does not have a space. You can also access multiple series, and you can do that by including the series name in a list.

Example → Lets try to access the list of Series. We shall be accessing the City & Edition series using double square-brackets.. NOTE: Press ‘Shift + Enter’ to execute the Row no. 16.

The other way of accessing the City information is to use the dot notation, which is olympicsDataFrame.City, and shift and enter to run that cell. And you can see, I get exactly the same output, as earlier. Now, remember that I can only use the dot notation when there are no spaces in the column name.

Next, it’s very important that we understand which object we are working with. So, we know that olympicsDataFrame is the data-frame, but let’s confirm that by looking at the type command. We know that the olympicsDataFrame.City is a Series, and we can confirm that using type. What might have not been so obvious is that if we use a couple of Series, such as City, Edition, and Athlete, so if I just copy that cell, this type is also a data-frame.

Pandas Validating the Input-Data using SHAPE :-

The shape attribute returns a tuple, that’s rows and columns representing the dimensionality of the DataFrame. It’s a good idea to check this against the original CSV file to ensure that all the expected data has been read into the DataFrame. The shape attribute is useful for confirming the dimensions of your dataset.

So, the 29,216 corresponds to the number of rows in the CSV file and the 10 corresponds to the number of columns. We can confirm that and we can see for example here that the city, edition, sport and so on, these are the 10 columns that we have and we actually have 29,216 rows of data in our CSV File. One helpful tip is that sometimes you might only want the number of rows or columns as part of your code and we can extract this information from this tuple by just typing shape and number in square-brackets. Number ‘0’ gives the number of rows and number ‘1’ gives the number of columns in the data-frame.

Viewing the Summary of the DataSet with ‘info()’ method :-

Info provides a summary of the data frame including the number of entries, the data type, and the number of non-null entries for each series in the data frame. This is important because often when working with a real data set, there can be some data missing in it. In that case, we want a view of this to determine how we will handle this missing data. For example, in our dataset, we can see that we do not have any missing data in this data set. There are 29,216 entries, and none of them are non-null.

What is also useful about the info method is that it also gives us the data types for the different series of columns. This is useful when we have to do any comparisons or queries. So for example, if we are querying what is a string, then we will know that we have to use quotes, and if we are using a comparison for an integer or a float value, then we won’t need to use those quotes.

Analysing the DataSet with ‘value_counts()’ method :-

value_counts()’ is one of the most useful methods in pandas. It returns a series object, counting all the unique values. There are two things in particular to be aware of value_counts. As this is returning a count of the unique values, the first value is the most frequently occurring element. The second, the second most frequently occurring element and so on. This order can be reversed by just setting the ascending flag to True. Dropna, one of the parameters within the value_counts is True by default and you will not get a count of the na values. The na values remember are the missing data values. If your data set has a significant number of na values, this can be misleading and you can turn this feature off by setting dropna to False. NOTE: Remember that value_counts gives you the unique values for that series.

Here, we are showing the value_count() for entire data-set :-

Example of ‘value-counts’ with some specific series → So if I wanted to know how many medals were presented each year, the Olympics were held, we can use below command. According to below analysis, the most medals were presented in the 2008 games and for whatever reason, there were more medals presented in the 2000 game versus the 2004 games. Please note that, below query sorts the records by values AND not by the Index-range :-

Another example → Let’s say, if I was interested to know over the history of the Olympics, so that’s from 1896 to 2008, how many medals were presented to men, and how many medals were presented to women, I get that information from the Gender field. I can see that there were 21,721 medals presented to men over the history of the Olympics and 7,495 to the women.

Another example → We might want to sort by ascending order instead. So all I need to do is to change the ascending flag here to True, and we can see that this has swapped those values across. By-default, you will not get a count of the na values. Remember those are the missing data values. If your data-set has a significant number of na values, this can be misleading. So we will not see any difference in our data set as we don’t have any missing data. So if I hit shift and tab, dropna, I need to just change dropna to False, and there will be no difference in what we see.

Using ‘sort_values()’ method on the data-set:-

Sort_values() sorts the values in a series. As axis is equal to zero, you are sorting along the column and in ascending order by default. So, if you visualise a series as being a single column, you are sorting the contents of that column in ascending order. By default, the NaNs, or missing data, are put right at the end. Sort_values(), when used in conjunction with a DataFrame, is particularly useful as you can sort multiple series in ascending and descending order.

Example → Let’s sort by the athletes’ names. So olympicsDataFrame is the name of the DataFrame, Sorting on ‘Athlete’ provides us the list of all of the athletes’ names sorted by the name of the athlete.

Another Example → We can also sort by the edition of the Olympics, and the athletes’ names. Here, we are using multiple-series and we can enter them as a list. So we sort by Edition, and Athlete.

This command shall sort firstly in the basis of ‘Edition’ and then within each edition of the Olympics, the athletes’ names are sorted all the way from A, B, and C to X, Y, and Z. And that corresponds to the 1896 Olympics, and all the way to the 2008 Olympics. So not surprisingly, in the 2008 Olympics, towards the end of our DataFrame, we have athletes with names such as Zubari and Zueva. And so looking at that one line of code, we’re sorting by edition first, and then the athletes’ names, and so each section of the edition will be sorted by athlete for each Olympic Edition.

Querying with Boolean Indexing on data-set :-

Boolean vectors or conditions can be used to filter data. Based on a condition, past series of true and false values to a series or data frame to select and display the rules where the series has true values. Instead of using and, or, or not, as with most programming languages, you can use the following symbols instead. Remember that if you have more than one condition, or Boolean vector, this must be grouped in brackets or parentheses. This is to ensure that the order of operations is carried out correctly.

Example → Let’s say we wanted to select all of the athletes who have won a gold medal. So if I hit enter, I’m going to get a whole series of trues or falses. You can see that the first true corresponds to this first record, or this first true, where this athlete has won a gold medal. The second one, the athlete has won a silver medal and that’s why you can see that this is a false. The third, the athlete has won a bronze medal and that’s why it’s a false, and so on.

Another example → If we wanted to actually evaluate this and determine which athletes have won gold medals, we enter this within square brackets. And we can see that we’ve now got a data frame where all the athletes have won a gold medal. And we can check this by looking at that final column here, where all the athletes have won a gold medal.

Another example → Let’s look at another example of how we can use Boolean Indexing. Let’s use multiple conditions. Let’s say we want to know all women athletes who have won a gold medal. Remember that we need to include the conditions in parentheses or brackets. And we want all women athletes. So olympicsDataFrame, which is the name of our data frame, gender, equals, equals, and then women. And when we run this cell, we see that we’ve got a data frame with all women, and the final column here is gold.

String handling in Pandas :-

String handling generally have names matching the equivalent scaler built in string methods that are available in Python. These are available under the str attribute. Using the str attribute, you have access to several common string methods, such as contains, startswith, isnumeric and so on.

Example → Let’s say we want to look up the legendary Flo Jo, who is the fastest woman ever. Her time for the 100 meter sprint from 1988 is still the fastest time recorded today. Say, you can’t remember her full name and we only know that her firsst-name is Florence. So I go to my data frame, Athlete, and I now use the new string attribute str.contains, and I search for Florence.

So, Florence Syers and so on. Clearly the one that we’re looking for is Flo Jo from the USA who ran the 100 meters and 200 meters in 1988. So her full name is Florence Griffith Joyner. That’s the end of our basic analysis.

Other examples of analysis with Pandas :- In below example, we can see that using a combination of sort_values() and value_counts() can help us answer some fairly interesting questions about our data.

Example #1.)

Example #2.)

For above query, below is the query we did used. Below query would filter out records on Gender as ‘Men’ AND Medal as ‘Gold’ AND Sport as ‘Badminton’.

dfForMGB = olympicsDataFrame[(olympicsDataFrame.Gender == ‘Men’) & (olympicsDataFrame.Medal == ‘Gold’) & (olympicsDataFrame.Sport.str.contains(‘Badminton’))]

Example #3.)

Example #4.)

Matplotlib Library :-

Matplotlib tries to make easy things easy and hard things possible, that’s what it says on the website. You can generate plots, histograms, bar charts, error charts with just a few lines of code. For simple plotting, the pyplot module provides a MATLAB-like interface. The line “import matplotlib.pyplot as plt” allows you to use the matplotlib.pyplot module using the abbreviation plt. One of the features of the IPython kernel is the ability to display plots that are the output of running code cells. The IPython kernel works seamlessly with the Matplotlib Plotting Library to provide this functionality. To set this up, you must execute that second line, matplotlib inline and that’s what’s known as a merger command.

With the Matplotlib inline backend, the output of the plotting commands is displayed inline, within the Jupiter Notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.

Types of plot :-

The plot command displays the graph and there are a couple of options. By default, the graph is a line-plot, but you can also specify that you want to use another type of graph such as a bar graph or a pie chart. You need to use the kind equals parameter to specify what type of plot you require. Half of the skill is in trying to determine which graph would be the best to represent the data.

  • We normally use a line-graph, when we are trying to track changes of a short or a long period of time. When smaller changes exist, line graphs are better to use than, say, a bar graph and line graphs can also be used to compare changes of the same period of time for more than one group.
  • We normally use a Bar-graph, when used to compare things between different groups or to track changes over time. However, when trying to measure change over time, bar graphs are best when the changes are large. This is the horizontal bar graph and it’s almost identical to a bar graph, but it has its axes flipped over. The bar-graph or a horizontal-bar-graph have the same use cases.
  • Pie charts are best to use when you’re trying to compare parts of a whole. They do not show changes over time.

Example 1 → What were the different sports in the first Olympics? Plot them using different graphs.

Lets first find the different sports in the first Olympics :-

We can then chain the plot command to this and remember, by default, we will get a line plot. To confirm that, I’m going to do kind equals line and we get the same plot.

If we want to suppress this matplotlib.axes here, I can add a semicolon to the end of that line and run the cell again.

As we can see above, there is overlapping of the values because of less spacing here, so we can also specify the graph’s size as well. Here, the width has been made as 10 and length has been made as 3.

Moving onto a bar plot, given that we are going to be using the same line, I’m just going to copy this line of code and the only change we need to make is that we need to specify that the plot is a bar plot and I can suppress that extra line there and you can see that the bar plot is a far more useful plot in this instance than a line plot, because it had overlapping labels in the axes.

Similarly, we can use the horizontal plot and all we need to do is kind equals barh for the horizontal plot and we get the same information as you would for a bar plot except the axes are swapped over.

Finally, we can plot using a pie chart. So, using exactly the same line of code; plot(kind=pie);

Note that, we wouldn’t use all of the different plots for our data. In this instance, the bar, the horizontal bar and the pie charts were the most effective in representing this data. In this instance, I wouldn’t use the line or pie graphs.

Colouring scheme in Graphs :-

We can also specify the colour-coding in an graph, by specifying the attribute ‘color’ inside the ‘plot’ chained command.

Colormaps

Colormaps are included with Matplotlib. The idea behind choosing a cool colormap, is to find a good representation of your data. It there an intuitive color scheme for the data? For example gold, silver, and bronze for medal winners. Blue for male, pink for female? There are three classes of colormaps. The sequential, the diverging, and the qualitative. The sequential should be used for representing information that has ordering. There is a change in lightness, often over a single hue. Diverging is to be used when the information being plotted deviates around a middle value. Here there are often two different colors being used. And finally, the qualitative class is used to represent information which does not have any ordering or relationship, and is often miscellaneous colors.

Note that: you would have run into problems if you tried to plot a pie chart using the color parameter, as this doesn’t exist.

Basic plotting using Seaborn Library :-

Seaborn is a visualization library based on Matplotlib. One of the reasons to use Seaborn is that it produces beautiful statistical plots. It is very important to realize that Seaborn is a complement and not a substitute to Matplotlib. Now one of the advantages again with using Seaborn is that it works very well with pandas.

Seaborn has an excellent examples gallery that provides plots on the website, and if you click on any of the plots, it provides the code to generate that plot.

Seaborn, as with Matplotlib, has methods for bar plots, histograms and pie charts. Let’s take a look at an example of one of the methods, countplot. Countplot has very similar parameters to Matplotlib. The data parameter for countplot is where you provide the DataFrame or the source for the data. The hue is for the categorical variables.

A categorical variable is one that can only take a fixed number of values. Ex —-> limited number of sports, the gender is either male or female, and medals are gold, silver or bronze.

There are two major classes of categorical-data i.e. nominal and ordinal.

  • In any nominal - categorical data attribute, there is no concept of ordering amongst the values of that attribute. Weather is a nominal-categorical attribute. Similarly movie, music and video game genres, country names, food and cuisine types are other examples of nominal categorical attributes.
  • In any ordinal - categorical attributes, there do exists some sense or notion of order amongst its values. For instance : shirt sizes. It is quite evident that order or in this case ‘size’ matters when thinking about shirts (S is smaller than M which is smaller than L and so on).

The hue allows you to specify a categorical variable in a different color. The order parameter allows you to determine the sequence of the categorical variables. And finally, the palette parameter allows you to specify colors for the different levels of the hue variable.

So when would you use Seaborn versus Matplotlib ? If you’re running a couple of scripts, then Matplotlib with pyplot is easy to use and great for bar charts, pie charts and so on. If you’re looking to write short scripts to deal with things like categorical data, or more advanced statistical plots, or creating other kinds of plots such as heatmaps, then Seaborn is going to be your choice.

Let’s import the Seaborn library. So import Seaborn as sns and run that.

Example 1 → How many medals have been won by men and women in the history of the Olympics?

Using the Pandas and MatPlotLib, here is how we shall be doing this ask.

Using the Seaborn library, here is how we shall be doing this ask :-

Example 3→ Using the Seaborn library, we can also change the color-scheme as per our desired colors using the option of ‘palette’ :-

Example 3→ Now if we wanted to know the number of gold, silver and bronze medals won for each gender, this is where we can use the categorical variable. All we need to do here is to add hue equals Medal and that will allow us to present the distribution of gender over the gold, silver and bronze medals amongst the men and the women.

Example 4→ Now other way around, say if we wanted to know the number of Men & Women for each different Medals, we can plot the number of Gold, Silver and Bronze Medals for each gender like this : -

Example 5→ Plot the number of Medals achieved by the Chinese team (Men and Women) in Beijing 2008 using both MatPlotLib and SeaBorn.

  • Using MatPlotLib :-
  • Using SeaBornLib :- Note that, here we have also used the colouring scheme as well with palette option as ‘bwr’.

Indexing using Pandas :- The index object is an immutable array, and indexing allows us to access a row or a column using a label. This is what makes Pandas special, because typically in other programming languages, we cannot access an array using labels. For e.g, The below data-frame is being indexed by a series of numbers starting from 0 to 29,216.

We can easily change the index (for any given data-frame) as well by following example, where we have set the ‘Athlete’ as the Index. Pay attention to the attribute ‘inplace’ in below query, which is a mandatory one for setting the Index. Now, we see that, it has maintained the athlete index and we can now proceed and search by the athlete name.

Another way of doing the same thing is by assigning the data-frame with new index to another data-frame.

i.) Let’s see this by first resetting to main index. This command shall reset the Index to our original Integer based Index. ‘inplace=True’ shall make sure that, the resetting operation shall be persisted :-

ii.) Now, lets assign the data-frame with changed index to another data-frame and then, we can query the new data-frame :-

Next, as the default index are sorted based upon the integer range, similarly we can also sort the records in a data-frame by our custom-index. Here is example below, where we show the records after sorting :-

Label based Indexer :- “loc[]” is a label-based indexer, that means, we are selecting by the labels (i.e. values of particular columns). Lets see example. We searched for a label called as ‘Bolt, Usain’ on the data-frame (for which Athlete is the Index) And we can see now that, we get a list of records where, Usain Bolt has won any kind of medals in the Olympics.

As a revision, if traditionally (on the original data-frame), we had to access the records (from within the data-frame), belonging to ‘Usain Bolt’ player, then we would do something like below :-

The same can as well be accessed using “loc” indexer as well :-

Kindly also note that, “loc[]” will raise the KeyError when items are not found in a data-frame. For e.g. Say we search for a player, which doesn’t exists in the Index, then it would throw the KeyError as follows :-

Label based Indexer :- “iloc[]” is an another label-based indexer, where we are doing selection by integer index. Iloc is primarily integer position based. For e.g. Say, If we want to see the 2100th entry, or the index entry, we would do something like this :-

Next, say, If we want to see the 1100th, 2100th, 3100th, 4100th & 5100th entries all together, then we shall be doing something like this :-

Next, say, If we want to see the rows from 11th to 21st index, then we could also use range based indexing :-

Another full-scale-example :-

Now, we can use from our knowledge so-far that, we can use the function ‘value_counts()’ on the Edition attribute :-

But, there is a problem (in above picture) of data not being sorted(based upon the Index i.e. Edition) in the aforesaid data. Pl note that, ‘Edition’ is a an index here. We want the data sorted on the Edition. So, from our recently learned knowledge, let’s sort the Index here :-

And, now we can plot the data using Matplotlib library :-

As a revision, there is another way of plotting the data using bar-graph as well :-

And, we have learnt, there is yet another way of plotting the data using Seaborn library (which is a quite simple as well) :-

Another full-scale-example :-

Let’s first find out all the records, where year is 2008 :-

Next, we apply below operations to find the result :-

GroupBy operation in Pandas :- Groupby is one of the most important functionalities available in Pandas. Groupby does three things. It splits a DataFrame into groups based on some criteria, it applies a function to each group independently and it combines the results into a DataFrame.

For example :- Let’s first apply ‘groupby’ operation using ‘Edition’ attribute. Note that, type of the ‘groupby’ would be ‘DataFrameGroupBy’. We also show the data inside this DataFrameGroupBy.

There is another traditional-way of seeing the data into the DataFrameGroupBy i.e. by iterating over through it :-

For better readability/visualisation, let’s limit the number of columns for each group, because the value itself is an another dataframe. See below query’s output, where 1896 is ONE group, 1900 is an another group and so on :-

Following are some of the interesting computational functions, that can be applied on the ‘groupby’ operator :-

Example #1 :- Let’s see the size-function in Pandas, using which we can know the count of records for each year wise :-

Pl note that, In 2008 year, in-total 2042 no. of medals were distributed. Let’s revisit from our learning, the traditional way of finding the aforesaid count of records each ‘Year’ wise :-

Example #2 :- Further, count of records can be found in multiple groups as well :-

Pl note that, In 1896 year, in-total TWO no. of medals were distributed to country AUS and In total 315 no of medals were won by USA in 2008. This very same requirement can be accomplished using the aggregate function as well. Let’s learn about the same, ‘agg()’ function :-

Example #3 :- Lets find, no of medals won by player: ‘LEWIS Carl’ grouped by Edition :-

Example #4 :- Lets find, no of medals won by player: ‘LEWIS Carl’ grouped by Edition :-

Example #5 :- Let’s take this full example head-on :-

Example #6 :- Let’s take another example :-

We perform ‘groupby’ operation on attribute ‘NOC’ and then perform aggregate upon the attribute ‘Edition’ (i.e. ‘Year’). We also print the min, max and count values :-

Example #7 :- Let’s take another example :- In Beijing 2008 Olympics, for 100m & 200m track & field-events, Find the number of all medals received by all athletes and country they represented.

Example #8 :- Plot the number of Gold-Medals won by US Male & Female athletes throughout the history of Olympics. First, we perform the filtering :-

Approach #1 :- Next, we find the count of medals won by Male & Females separately using value_counts() :-

Approach #2 :- We can also find the count of medals won by Male & Females using groupby() operation :-

HeatMaps :- Seaborn’s heatmap provides very intuitive plots.

Example #1 :- Let’s say, we wanted to present a summary of the total medals won by participating countries in the 2008 Beijing Olympics. We would show this, how it can be achieved using heatmaps :-

First, we filter out only those records from the dataframe, which are for Year 2008.

Next, we now perform the ‘groupby’ operation at the aforesaid filtered dataframe :-

Next, we shall beautify the above data, by performing ‘unstack’ operation on the data. This shall spread the Medals for country wise :-

Next, we shall do the data-preprocessing by filling the column with ‘NA’ values with ZERO. Also, below, we have presented the ‘Gold’ medal column first, ‘Silver’ medal column second and ‘Bronze’ medal column third.

Next, lets sort the data in decreasing order of no of medals won by each country for each medal-type :-

Next, lets present this data using heatmaps :-

Next, In order to make the data more readable, lets do the transpose of the data axes. We now have ‘NOC’ on the X-axis and ‘Medal-Type’ on the Y-axis.

Next, Let’s plot the aforesaid data again on the heatmap. In below representation, it’s evident from the below map that, USA country have won more number of medals in Gold & Silver category vs Bronze category. Similarly AUS country have got more no. of medals in Bronze category vs Gold or Silver category.

Example #2 :- Plot the 5 Atheletes, who have won the most gold-medals over the history of the Olympics. In case of tie, consider the silver-medals and then bronze-medals.

First, we perform ‘groupby’ operation upon attribute: ‘Athlete’ and ‘Medal’ and then find size() operation on it.

Next, we perform ‘unstack’ operation upon ‘Medal’ attribute to spread it out, Medal wise and fill value as ZERO for NA.

Next, we sort the records by value of Medals in ‘Gold’ category in descending order. We also want to display the category Gold column first, Silver column second and then Bronze column in last.

Next, lets plot some of the records in form of a bar-graph :-

Next, Say we wanted to paint custom-colors in the graph as per our needs, then we could create our own coloringScheme as well :-

Example # 3:- In every Olympics, which US Athlete has won the most total number of Medals. Include Athlete’s discipline as well.

First, we filter the records by country as USA and then perform a groupby upon attributes : ‘Edition’, ‘Athlete’ & ‘Medal‘.

Second, we unstack the dataframe with ‘Medal’ attribute :-

Third, we create a new variable in this Dataframe i.e. by summing up all the counts of medals i.e. (Gold + Silver + Bronze), from the dataframe itself :-

Next, we loop through this groupBy dataframe in order to print the maximum value of attribute ‘Total’ :-

With this, we come end to this blog. Thanks for making it till here. Hope this blog shall give you fair understanding on the concepts of Python-Pandas..

References :-

--

--

aditya goel

Software Engineer for Big Data distributed systems