Skip to main content

Data Exploration

Data exploration is the process of examining and understanding your dataset's structure and content. This guide will walk you through a typical dataset exploration workflow using qsv commands in your terminal.

For the following examples, we'll use a sample dataset from the Dubai Land Department (DLD) containing real estate transaction information. You can access the dataset here:

Best Practices

  1. Be consistent: Use the same command structure and naming conventions throughout your exploration.
  2. Document your process: Keep track of the commands you run and their purposes.
  3. Iterative exploration: Revisit earlier steps as you gain new insights into your dataset.
  4. Version control: Save different versions of your cleaned and transformed data.
  5. Automate when possible: Create shell scripts for repetitive tasks.

Data Exploration Workflow with qsv

1. Initial Data Loading and Inspection

a. Preview the data as a table:

Preview data as table
qsv sample 10 'DLD_Transactions_English_500.csv'  \
| qsv table

This command randomly selects 10 rows from the CSV file, giving you a quick overview of the data structure and content.

b. Preview the data in a flattened format:

Preview data in flattened format
qsv sample 5 'DLD_Transactions_English_500.csv'  \
| qsv flatten

This command shows the first 10 rows of the dataset in a flattened format, this is useful to see the data in a more readable format, especially when there are a lot of columns.

c. Check dimensions:

Check dimensions
qsv count 'DLD_Transactions_English_500.csv'
qsv headers 'DLD_Transactions_English_500.csv' \
| qsv table

The first command counts the number of rows in the file. The second command displays the header row, showing all column names.

d. View the first few rows:

View first few rows
qsv head -n 5 'DLD_Transactions_English_500.csv'  \
| qsv table

This displays the first 5 rows of the dataset.

e. View the last few rows:

View last few rows
qsv tail -n 5 'DLD_Transactions_English_500.csv'  \
| qsv table

This displays the last 5 rows of the dataset.

2. Understand Data Structure

a. Examine data types and summarize the data:

Examine data types and summarize
qsv stats 'DLD_Transactions_English_500.csv'  \
| qsv table

This command provides summary statistics for each column and displays them in a table.

b. Identify missing values:

Identify missing values
qsv stats 'DLD_Transactions_English_500.csv'  \
| qsv select field,nullcount \
| qsv table

This command chain shows the number of null values in each column in a tabular format.

c. Check for duplicate rows:

Check for duplicate rows
qsv dedup --count 'DLD_Transactions_English_500.csv'  \
| qsv table

This command identifies and counts duplicate rows in the dataset.

d. Examine a specific column's data:

Examine a specific column's data
qsv select area_name_en 'DLD_Transactions_English_500.csv'  \
| qsv table

This command selects and displays only the 'area_name_en' column.

e. Get unique values in a column:

Get unique values in a column
qsv unique area_name_en 'DLD_Transactions_English_500.csv'  \
| qsv table

This command displays the unique values in the 'area_name_en' column.

f. Get the top N values in a column:

Get top N values in a column
qsv top -n 5 area_name_en 'DLD_Transactions_English_500.csv'  \
| qsv table

This command displays the top 5 most frequent values in the 'area_name_en' column.

3. Data Cleaning

a. Handle missing values (example: remove rows with missing values):

Handle missing values
qsv search -s transaction_id [^[:space:]] 'DLD_Transactions_English_500.csv' > 'cleaned_transactions.csv'
qsv sample 10 'cleaned_transactions.csv' \
| qsv table

This removes rows with missing transaction_ids and displays a sample of the result.

b. Standardize data (example: convert dates to a consistent format):

Standardize data
qsv select transaction_id,instance_date 'cleaned_transactions.csv'  \
| qsv datefmt instance_date %Y-%m-%d \
| qsv table

This formats the instance_date column to "YYYY-MM-DD" and displays the result.

c. Remove duplicate rows:

Remove duplicate rows
qsv dedup 'cleaned_transactions.csv' > 'unique_transactions.csv'
qsv count 'unique_transactions.csv'

This removes duplicate rows and shows the count of unique transactions.

d. Fill missing values with a default:

Fill missing values with a default
qsv fill -c property_type_en Unknown 'cleaned_transactions.csv'  \
| qsv table

This fills any missing values in the 'property_type_en' column with 'Unknown'.

e. Replace specific values:

Replace specific values
qsv replace -c area_name_en Dubai Dubai City 'cleaned_transactions.csv'  \
| qsv table

This replaces the value "Dubai" with "Dubai City" in the 'area_name_en' column.

f. Remove rows with specific values:

Remove rows with specific values
qsv search -v -c area_name_en Dubai City 'cleaned_transactions.csv' > 'filtered_transactions.csv'
qsv count 'filtered_transactions.csv'

This removes rows where the 'area_name_en' column contains "Dubai City" and shows the count of the resulting dataset.

4. Exploratory Data Analysis (EDA)

a. Examine unique values in a column:

Examine unique values in a column
qsv frequency 'cleaned_transactions.csv' area_name_en  \
| qsv table

This counts the frequency of each unique value in the area_name_en column and displays it in a table.

b. Basic statistical analysis:

Basic statistical analysis
qsv stats 'cleaned_transactions.csv' meter_sale_price  \
| qsv table

This provides summary statistics for the meter_sale_price column in a tabular format.

5. Data Transformation

a. Feature engineering (extract year from date):

Feature engineering (extract year from date)
qsv select transaction_id,instance_date 'cleaned_transactions.csv'  \
| qsv datefmt instance_date %Y \
| qsv table

This extracts just the year from the instance_date column and displays the result.

b. Data aggregation (example: count transactions by year):

Data aggregation (count transactions by year)
qsv select transaction_id,instance_date 'cleaned_transactions.csv'  \
| qsv datefmt instance_date %Y \
| qsv frequency instance_date \
| qsv table

This counts the number of transactions per year and displays the result in a table.

6. Documentation and Reporting

Generate a summary report:

Generate summary report
qsv stats 'cleaned_transactions.csv'  \
| qsv table > 'summary_report.txt'

This generates summary statistics for all columns and saves them to a text file in a tabular format.

7. Prepare Data for Modeling

Save processed data:

Save processed data
qsv select transaction_id,area_name_en,meter_sale_price,instance_date cleaned_transactions.csv > prepared_data.csv \
\
| qsv sample 10 prepared_data.csv \
\
| qsv table

This selects specific columns for modeling and displays a sample of the result.

This workflow demonstrates how to use qsv commands in your terminal to explore and prepare your data. Remember that data exploration is often an iterative process, and you may need to revisit earlier steps as you gain new insights into your dataset.

Note: Some advanced operations like visualization and hypothesis testing are not directly available in qsv. For these tasks, you might need to use the output from qsv with other tools or programming languages.