How to Analyze Data

After your team and data analyst have finished setting your objectives and gathering data you need to analyze your data to meet your objectives. When analyzing data you can use descriptive, visual, inferential, or modeling techniques. In this article we discuss various data analysis techniques and tools to use in analyzing your data.

Summarizing Data Using Descriptive Statistics

Descriptive statistics help you summarize and understand your data. There are different techniques for summarizing your data depending on if your data is categorical or continuous. Categorical data refers to observations that fall into distinct categories for example male or female. Continuous data refers to observations that do not have any distinct categories such as weight.

When your data is categorical the most useful descriptive technique to use is count. You count the number of observations that occur in each category. For example, when you have one variable such as gender you count the number of people who are male and those who are female. When you would like to know the number of people in each category as a proportion of the total you use a percentage. In the gender example we can calculate the percentage of those who are male and the percentage of those who are female.  

As you summarize categorical data you are not limited to one variable. To summarize into categorical variables we use a cross tabulation. In a cross tabulation one variable forms the rows and the other variable forms categories. We then count the number of observations that fall in each category. If in our example we also have an education variable we would be interested in knowing the education levels of males and females. These education variables could be defined categories: no education, primary, secondary, college and university.  

For continuous variables there are descriptive measures that tell us how our observations cluster around a single value and those that tell us how our observations are spread. The mean and the median are two common measures that are used to summarize data. The mean is an appropriate measure when we have observations almost falling on either side. The median is an appropriate summary when we have most observations falling on one side such as our observations are skewed.

If we collect observations on weight of adult patients we can use the mean to get the typical weight of a patient. If we collect observations on salaries we will have a few people earning much more than others, in that case the median would be a better summary.

The minimum, the maximum, the range, and the standard deviation tell us how observations are spread. The minimum tells us the lowest observation, the maximum tells us the highest observation, and the range gives us the difference between the lowest and the highest observation in our data. The variance and the standard deviation tell us how a mean value varies. 

The confidence interval is calculated from the standard deviation and it gives us the upper and lower bounds of a mean value. When you have two continuous variables a correlation coefficient helps you understand the strength and direction of relationship.

A negative coefficient shows you when one variable increases the other variable decreases. A positive coefficient shows you when one variable increases the other variable decreases. A correlation value close to zero shows you there is weak or no relationship. A value of 0.5 shows moderate strength while a value close to 1 shows you there is a strong relationship.

Visualizing Data With Graphs

There are different tools for visualizing categorical and continuous data. To visualize categorical data you use a pie chart or a bar chart. A pie chart divides a circular shape into angular portions that enable you to see the count or percentage of observations that are in each category. A pie chart can only be used to visualize one categorical variable. A bar chart helps you visualize categorical data using vertical or horizontal bars that show you the count or percentage of observations in each category.

You can add the count or percentage of each category on the bars for easy comparison. Bars that are taller than the others show more observations in those categories. A bar chart can be used to summarize one or two categorical variables.

To visualize continuous observations you can use a histogram, a box plot, a scatter plot or a line plot. A histogram uses bars similar to a bar chart to visualize continuous observations. The key difference is that bars in a bar plot are for a single category while bars in a histogram show a range of values. A box plot summarizes data using a box and whiskers. The whiskers on both ends of the box plot show you the minimum and maximum observations in your data. Observations that lie beyond the whiskers are outliers.

The box shows you where half of your observations lie and within the box there is a line that shows you where the median lies. The histogram and box plot are useful for visualizing the distribution of your observations. The scatterplot helps you visualize the relationship between two continuous variables. It helps you visualize the direction and strength numerically shown by a correlation coefficient.

Making Inferences From Data

The techniques we have discussed so far help you summarize your data. To test hypotheses about your data you use inferential techniques. There are different techniques for continuous and categorical variables.

A Chi-square test helps you test if there is any relationship between categorical variables. For example, in summarizing categorical data example we can use a Chi-square to test if education levels of men and women differ. For continuous variables we are mostly interested in the mean, where we can use T tests or analysis of variance (ANOVA).

There are three variants of the T test that help us test if the mean of one variable differs from a target mean, if the means of two variables differ and if the mean of one variable differs at two different time points. ANOVA extends T tests by helping us test if more than two means are different.

To help support the process of data analysis your data analysts will use both commercial and open source tools have been developed. Popular commercial data analysis tools include IBM SPSS, SAS, Stata, Excel, and Minitab. These tools provide a graphical user interface and a programming language for data analysis. R is a popular open source tool that is used to analyze data by writing programs. All of the tools and techniques we have mentioned support all the data analysis techniques we have discussed.