Data Exploration

Data exploration is the process of examining and understanding your dataset's structure and content. This guide will walk you through a typical dataset exploration workflow using qsv commands in your terminal.

For the following examples, we'll use a sample dataset from the Dubai Land Department (DLD) containing real estate transaction information. You can access the dataset here:

Dataset overview: Dubai Pulse - DLD Transactions
Direct download link: transactions.csv

Best Practices

Be consistent: Use the same command structure and naming conventions throughout your exploration.
Document your process: Keep track of the commands you run and their purposes.
Iterative exploration: Revisit earlier steps as you gain new insights into your dataset.
Version control: Save different versions of your cleaned and transformed data.
Automate when possible: Create shell scripts for repetitive tasks.

Data Exploration Workflow with qsv

1. Initial Data Loading and Inspection

a. Preview the data as a table:

Input file:

Number of rows:

Preview data as table
qsv sample 10 'DLD_Transactions_English_500.csv'  \
|  qsv table

This command randomly selects 10 rows from the CSV file, giving you a quick overview of the data structure and content.

b. Preview the data in a flattened format:

Input file:

Number of rows:

Preview data in flattened format
qsv sample 5 'DLD_Transactions_English_500.csv'  \
|  qsv flatten

This command shows the first 10 rows of the dataset in a flattened format, this is useful to see the data in a more readable format, especially when there are a lot of columns.

c. Check dimensions:

Input file:

Check dimensions
qsv count 'DLD_Transactions_English_500.csv'
qsv headers 'DLD_Transactions_English_500.csv'  \
|  qsv table

The first command counts the number of rows in the file. The second command displays the header row, showing all column names.

d. View the first few rows:

Input file:

Number of rows:

View first few rows
qsv head -n 5 'DLD_Transactions_English_500.csv'  \
|  qsv table

This displays the first 5 rows of the dataset.

e. View the last few rows:

Input file:

Number of rows:

View last few rows
qsv tail -n 5 'DLD_Transactions_English_500.csv'  \
|  qsv table

This displays the last 5 rows of the dataset.

2. Understand Data Structure

a. Examine data types and summarize the data:

Input file:

Examine data types and summarize
qsv stats 'DLD_Transactions_English_500.csv'  \
|  qsv table

This command provides summary statistics for each column and displays them in a table.

b. Identify missing values:

Input file:

Identify missing values
qsv stats 'DLD_Transactions_English_500.csv'  \
|  qsv select field,nullcount  \
|  qsv table

This command chain shows the number of null values in each column in a tabular format.

c. Check for duplicate rows:

Input file:

Check for duplicate rows
qsv dedup --count 'DLD_Transactions_English_500.csv'  \
|  qsv table

This command identifies and counts duplicate rows in the dataset.

d. Examine a specific column's data:

Input file:

Column name:

Examine a specific column's data
qsv select area_name_en 'DLD_Transactions_English_500.csv'  \
|  qsv table

This command selects and displays only the 'area_name_en' column.

e. Get unique values in a column:

Input file:

Column name:

Get unique values in a column
qsv unique area_name_en 'DLD_Transactions_English_500.csv'  \
|  qsv table

This command displays the unique values in the 'area_name_en' column.

f. Get the top N values in a column:

Input file:

Column name:

Number of top values:

Get top N values in a column
qsv top -n 5 area_name_en 'DLD_Transactions_English_500.csv'  \
|  qsv table

This command displays the top 5 most frequent values in the 'area_name_en' column.

3. Data Cleaning

a. Handle missing values (example: remove rows with missing values):

Input file:

Output file:

Search pattern:

Search column:

Number of rows:

Handle missing values
qsv search -s transaction_id [^[:space:]] 'DLD_Transactions_English_500.csv' > 'cleaned_transactions.csv'
qsv sample 10 'cleaned_transactions.csv'  \
|  qsv table

This removes rows with missing transaction_ids and displays a sample of the result.

b. Standardize data (example: convert dates to a consistent format):

Input file:

Date column:

Date format:

Standardize data
qsv select transaction_id,instance_date 'cleaned_transactions.csv'  \
|  qsv datefmt instance_date %Y-%m-%d  \
|  qsv table

This formats the instance_date column to "YYYY-MM-DD" and displays the result.

c. Remove duplicate rows:

Input file:

Output file:

Remove duplicate rows
qsv dedup 'cleaned_transactions.csv' > 'unique_transactions.csv'
qsv count 'unique_transactions.csv'

This removes duplicate rows and shows the count of unique transactions.

d. Fill missing values with a default:

Input file:

Column name:

Default value:

Fill missing values with a default
qsv fill -c property_type_en Unknown 'cleaned_transactions.csv'  \
|  qsv table

This fills any missing values in the 'property_type_en' column with 'Unknown'.

e. Replace specific values:

Input file:

Column name:

Search value:

Replace value:

Replace specific values
qsv replace -c area_name_en Dubai Dubai City 'cleaned_transactions.csv'  \
|  qsv table

This replaces the value "Dubai" with "Dubai City" in the 'area_name_en' column.

f. Remove rows with specific values:

Input file:

Column name:

Search value:

Output file:

Remove rows with specific values
qsv search -v -c area_name_en Dubai City 'cleaned_transactions.csv' > 'filtered_transactions.csv'
qsv count 'filtered_transactions.csv'

This removes rows where the 'area_name_en' column contains "Dubai City" and shows the count of the resulting dataset.

4. Exploratory Data Analysis (EDA)

a. Examine unique values in a column:

Input file:

Column name:

Examine unique values in a column
qsv frequency 'cleaned_transactions.csv' area_name_en  \
|  qsv table

This counts the frequency of each unique value in the area_name_en column and displays it in a table.

b. Basic statistical analysis:

Input file:

Column name:

Basic statistical analysis
qsv stats 'cleaned_transactions.csv' meter_sale_price  \
|  qsv table

This provides summary statistics for the meter_sale_price column in a tabular format.

5. Data Transformation

a. Feature engineering (extract year from date):

Input file:

Date column:

Date format:

Feature engineering (extract year from date)
qsv select transaction_id,instance_date 'cleaned_transactions.csv'  \
|  qsv datefmt instance_date %Y  \
|  qsv table

This extracts just the year from the instance_date column and displays the result.

b. Data aggregation (example: count transactions by year):

Input file:

Date column:

Date format:

Data aggregation (count transactions by year)
qsv select transaction_id,instance_date 'cleaned_transactions.csv'  \
|  qsv datefmt instance_date %Y  \
|  qsv frequency instance_date  \
|  qsv table

This counts the number of transactions per year and displays the result in a table.

6. Documentation and Reporting

Generate a summary report:

Input file:

Output file:

Generate summary report
qsv stats 'cleaned_transactions.csv'  \
|  qsv table > 'summary_report.txt'

This generates summary statistics for all columns and saves them to a text file in a tabular format.

7. Prepare Data for Modeling

Save processed data:

Input file:

Output file:

Number of rows:

Columns to select:

Save processed data
qsv select transaction_id,area_name_en,meter_sale_price,instance_date cleaned_transactions.csv > prepared_data.csv \
 \
|  qsv sample 10 prepared_data.csv \
 \
|  qsv table

This selects specific columns for modeling and displays a sample of the result.

This workflow demonstrates how to use qsv commands in your terminal to explore and prepare your data. Remember that data exploration is often an iterative process, and you may need to revisit earlier steps as you gain new insights into your dataset.

Note: Some advanced operations like visualization and hypothesis testing are not directly available in qsv. For these tasks, you might need to use the output from qsv with other tools or programming languages.

Best Practices​

Data Exploration Workflow with qsv​

1. Initial Data Loading and Inspection​

2. Understand Data Structure​

3. Data Cleaning​

4. Exploratory Data Analysis (EDA)​

5. Data Transformation​

6. Documentation and Reporting​

7. Prepare Data for Modeling​

Best Practices

Data Exploration Workflow with qsv

1. Initial Data Loading and Inspection

2. Understand Data Structure

3. Data Cleaning

4. Exploratory Data Analysis (EDA)

5. Data Transformation

6. Documentation and Reporting

7. Prepare Data for Modeling