Simple Excel formulas to determine patterns in your data

Ashton Welcome
8 min readDec 20, 2018
https://tinyurl.com/yd8g7qmv

Excel is an extremely powerful research tool because it has the ability to work through thousands of rows and columns of information in a matter of seconds. I have only scratched the surface of its capabilities but I would still like to share my shavings of knowledge in case it can help you with your own data analyses.

I used Excel in my research of the food plants of southern Africa. I did a literature study, using 74 references, which resulted in a list of 1740 edible plants. This is a huge data set on all of the edible plants which included information such as:

  1. The scientific name of each plant (This is a two-part name known as a binomial name)
  2. The plant family of each plant (more than one edible plant can belong to a plant family)
  3. The parts of the plant that are used as food and the category of food use
  4. The references (simplified into letters with numbers)

This is how the excel spreadsheet looks

I wanted to determine a number of patterns using this data and I ended up using only the “Remove duplicates” and “Text to columns” functions and the COUNTIF and SUM formulas.

The “Remove duplicates” function is under the data tab. This function identifies unique words in a specific column. When each unique word is found more than once, the rows which contain the duplicate is removed. This leaves you with a column only containing the unique words.

The “Text to columns” function is also under the data tab. This function allows you to split the information from one column into many columns based on either a space, a semicolon or even the number of letters.

The COUNTIF statement looks like this “=COUNTIF(range,criteria)”. In newer versions of Excel, the comma is replaced with a semicolon. This formula allows you to count how many times a certain word (the criteria) is found in a specific column (the range). So it will end up looking like this “=COUNTIF(A:A,B1)”.

So what patterns did I want to determine?

Firstly, I wanted to know how many edible plants are in each plant family? To answer this question, I used the Remove duplicates function and the COUNTIF statement.

I pasted the family column into a new sheet and selected the entire sheet.

On the data tab, I selected “Remove duplicates”.

This is the pop up that appears after selecting “Remove duplicates”.

Now I had a column of only unique families.

I pasted this column back into my original sheet and then counted how many times each family name appeared in the original family column using the COUNTIF formula. The calculation can be dragged down so that the criteria can adjust according to each row ($ signs will fix the criteria to one cell which is not required in this case). The range does not adjust if you have selected a whole column as your range (such as A:A).

I can then do a custom sort so that the data is arranged from largest to smallest, to see which family has the most edible plants.

The next question to answer is, how many edible plants are in each genus?

As I mentioned earlier, every plant has a binomial name which is made up of a genus name and specific epithet. The full name (including the author of the name) is in the column. I only need the genus name to determine this result.

All I need to do is select the column with the plant names and go to Data and select “Text to columns”. (Before using the “Text to columns” function, copy and paste the column that you would like to split into the last empty column of your worksheet. This will allow it to split the data into other empty columns.)

Select the delimited option.

Choose a space as the delimiter and click finish.

Your column will now be split into many columns so that you can have the information you need on its own.

I then followed the same steps I did for the families by pasting the column containing only the genus names into a new sheet and deleting the duplicates. Once the duplicates have been removed, the column containing only the unique genus names can then be taken back to the original sheet to use the COUNTIF formula.

The next question is, how many edible plants have edible roots, leaves or fruits and how many edible plants are used as raw snacks or cooked vegetables? This is where it gets a little more interesting because there is too much data to do simple filtering and counting. Each edible plant has a plant part that is used as food. In some cases, one plant has more than one part being used or one part has more than one use.

What I did here was to paste the column into a word document and remove all spaces and replace all colons and commas with semicolons.

I took that column back to excel and used the “Text to columns” function by letting it identify breaks based on semicolons (instead of a space as used above).

Because some species had more information than others, they could have used more columns to split the data into. Find and indicate the last column that contains information. This will be the range which you will use in the next step. In this case, the range is Q2:AN2.

Now that we have our range, we need the criteria. I made headings at the top of the sheet with all of the plant part and plant use options (the criteria). Next to this was the data that was split in the previous step.

For the COUNTIF formula, the criteria needs to stay fixed (unlike above when we needed it to change as it was dragged down the rows). Fix the criteria to one cell using $ signs. The following COUNTIF formula was used “=COUNTIF(Q2:AN2,$K$1)” This searched for “und” (underground storage organs) in the range where the information was split. The formula had to be changed for each plant part and plant use and then it could be dragged down the rows so that the range could alter for each row.

You can add a row after your headings and SUM the column to find out how many edible plants have edible underground storage organs or stems or leaves or fruits and how many are snacks or vegetables. It is important to remember though that in some cases one plant might have more than one plant part used as a vegetable for instance. So each of the columns need to be filtered so that if there is any number higher than 1 (which implies that the word was counted more than once for a specific plant) it will need to be changed to 1.

This data could also be filtered (and pasted into a new sheet) according to families to determine what the plant parts and use patterns are for individual families.

The last question to answer is, which edible plants have been reported in literature many times and which literature sources contributed the most information on edible plants? I followed the same steps as above with the plant parts and uses. Simply paste the column into word and remove all spaces and replace commas with semicolons. Note: If you use the same style as I did for your references, the letter R will need to be replaced with something else (maybe a double A) so that Excel can read it and count it. Apparently the same goes for C, but I never had problems with it.

Split the data based on semicolons. Make headings so that each reference appears as a heading, this is your criteria which will need to be fixed with $’s. Use the COUNTIF formula for the horizontal range. You will not need to search for any values higher than 1 because none of the plants will have references repeated. In the same way as above, you will be able to count how many plants have a specific reference. You can also do the SUM calculation next to each plant to count how many references it is found in. You can then see which references contributed the most edible plants and which edible plants have the highest number of citations.

That is how I worked out my results! Hope this helps you. Check out my article to see the final results.

I had some other interesting results to determine using different data. This required me to use the VLOOKUP function and a regression analysis which I will write about in my next article. Keep a look out.

Thank you to Dustin Botes for teaching me so many useful Excel formulas, and Peter-John Welcome for reading this article and always making sure that my computer was powerful enough to calculate what I need it to.

--

--