Data Cleaning and Exploratory Process

Tableau Software

Tableau is a powerful data visualization and business intelligence software that helps individuals and organizations transform raw data into insights. Known for its intuitive drag-and-drop interface, Tableau enables users to create dynamic dashboards, interactive visualizations, and detailed reports without requiring extensive technical expertise. Tableau supports data integration from multiple sources, including spreadsheets, databases, and cloud services, making it versatile for various industries and use cases.

The Tableau Data Builder, part of Tableau’s ecosystem, is a tool designed to streamline the process of preparing and shaping data for analysis. It allows users to combine, clean, and organize datasets with ease, ensuring data quality and readiness before visualization. With features like automated data modeling, real-time collaboration, and support for complex transformations, Tableau Data Builder empowers users to focus more on analyzing and interpreting data rather than spending time on preprocessing tasks.

In this tutorial, you will explore both Tableau software and Tableau Data Builder. Let’s consider the case of a data analyst working for a city’s police department. The department has recently decided to modernize its crime database to enhance the accuracy of its crime mapping system. The analyst is tasked with cleaning and restructuring the existing crime data to make it more reliable and actionable for visualization and decision-making.

Preparing the Dataset for Analysis

In this tutorial, we will work with a dataset containing crime records for Mexico City. This dataset includes information such as the type of crime, the date and time it occurred, and spatial data like latitude and longitude. However, like many real-world datasets, it contains inconsistencies and missing values that need to be addressed before analysis. Using Tableau Data Builder, we will clean and prepare the dataset while ensuring the original file remains untouched. This step is critical for maintaining data integrity and ensuring accurate and actionable insights during visualization.

To prepare the dataset for analysis and meet the specific criteria required for visualization, we will address the following:

  • Eliminate records without spatial location using Tableau Data Builder. Records without spatial information (latitude and longitude) lack the ability to be mapped effectively.
  • Define a relevant time frame for the analysis and exclude records outside this range. This ensures the data is consistent and focused on the period of interest.
  • Filter out entries that do not include a timestamp or date of when the crime occurred. This ensures the dataset can support time-based visualizations and analysis.
  • Identify and correct erroneous values in the longitude field to maintain the integrity of spatial data. Tableau Data Builder allows for easy detection and rectification of such anomalies.
  • Extract all records related to robbery in any of its forms. This subset can then be used for specific visualizations or in-depth analysis focusing on robbery trends in Mexico City.

Once these steps are completed, the prepared dataset will be ready for use in Tableau, where you can create visualizations to uncover trends and insights in crime data. By using Tableau Data Builder for this process, you ensure the original dataset remains untouched while maintaining a streamlined and clean data preparation workflow.

The following steps will be followed:

  • Connecting the file in Tableau – exploring the dataset!
  • Create a Clean Step
    • Handle Null values and correct false “positive values” in coordinates
    • Verify the correct dates
  • Add an Output step and name the file and the location, and the output format
  • Run the model

1. Connecting the File in Tableau

Launch Tableau, go to the “Connect” pane, and select the appropriate file type (e.g., CSV, Excel). Browse to the file location, open it, and preview the data to ensure it loads correctly. Familiarize yourself with the dataset structure and content.

Connect to file – in this case, we are adding a text file (*.csv)

  • Next, search the file \Data\Data_CdMx_Crime
  • Open the file named ‘victimas_completa_octubre_2021.csv’
  • Confirm that the input data parameters (e.g., file format, column names, and data types) match the expected structure of your dataset.
  • Double-check that you are using a copy of the original file to preserve the integrity of the raw dataset.

2. Clean Step

After connecting to the dataset, add a “Clean Step” to start preparing the data. This will allow you to apply transformations without altering the original file.

Review the summary graphs provided in Tableau Prep Builder for each attribute. These visualizations display the distribution of the data and help identify patterns, outliers, or inconsistencies. Also, look for any extreme values, unusual patterns, or inconsistencies in the dataset. Pay particular attention to numeric fields, spatial data, and date ranges.

Next, we will create filters to remove outliers, change the data type, null values, or any records that do not meet the criteria for your analysis. For example, filter out records with missing spatial data, incorrect longitude/latitude values, or invalid date entries.

Check for Null values in coordinates in the latitude and longitude fields. Null values indicate missing spatial data, which should be filtered out to ensure accuracy in mapping.

Next, we will identify and correct the ‘false positive’ longitude values. In Mexico City, longitude values should always be negative. However, some entries might have false positive values (e.g., incorrectly recorded as positive). To fix this:

  • Add a calculated field by clicking on the dropdown arrow next to the longitude field and selecting “Create Calculated Field.”
  • Name the calculated field “Longitud_Fix” for clarity.
  • Enter the following formula into the calculated field editor: (ABS([longitud])) * -1
  • This formula takes the absolute value of the longitude (ensuring it is positive) and multiplies it by -1 to make it negative, as it should be for Mexico City.
  • If you no longer need the original longitude field, replace it with “Longitud_Fix” in your dataset to standardize your data.

These steps ensure that all longitude values are accurate and consistent with the spatial requirements for Mexico City, improving the reliability of your mapping visualizations.

Review all fields in the dataset to ensure they are complete and accurate. Pay close attention to the ‘dates’ field to verify that all entries are valid and correctly formatted. Also, Identify and delete all records with null values in the following fields:

  • Latitude
  • Longitude
  • Fecha Hecho (Date of the incident)

Once the data cleaning is complete, save your Tableau Prep flow to ensure all changes are retained for future reference. When saving the output files, make sure to store them within the designated project’s folder. This will help maintain organization and ensure all project-related files are easily accessible.

3. Creating an Output

In the Tableau Prep Builder interface, locate the Output option in the toolbar or right-click on the last step in your flow and select Add Output.

In the output configuration panel, you will need to specify the following:

  • File Name: Enter a descriptive name for the output file (e.g., “Cleaned_Crime_Data_Oct_2021”).
  • File Format: Choose the desired output format (e.g., CSV, Excel, or Tableau Hyper Extract). Select the format based on how you intend to use the dataset.
  • File Location: Browse to your project’s folder and set it as the destination for saving the output file. Ensuring the file is saved in the correct location helps maintain project organization.
  • Run the model!