Getting Started with Your Project Data

A guide to exploring your HIC data extract using SQL Server, Python and R.

Overview

Every project is unique and the datasets provided are specific to its needs. This page is an introduction to a few of the most common datasets and a gentle initiation to SQL for querying and linking them together.

When your data extract is delivered, you'll find your project tables in a database named Project_XXXXX (where XXXXX is your project number).

Unless otherwise specified, the unique patient identifier across all datasets is PROCHI. This variable is unique to your project. Every PROCHI represents a single individual in your cohort, and it is the key you use to link tables together.

Lookup tables

Alongside your clinical data you'll find lookup tables that help you decode coded values. These follow a naming convention:

Working database: You also have a Project_XXXXX_Working database. Use this to store any tables you create or modify. Never write back to the main Project_XXXXX database.

Connecting

For instructions on how to connect to your project database from SSMS, Python, R and other tools, see How do I connect to my study's SQL database? on the HIC Knowledge Base. The examples below assume you already have an open connection.

Demography

The Demography dataset contains one row per individual in your project cohort. It holds non-medical information about each patient, including three core columns:

Depending on your project, you may also find postcode district, SIMD deprivation metrics (by DataZone), anonymised GP practice, and care home residency flags. Your table will typically be at Project_XXXXX.dbo.Demography_Current.

Exploring Your Cohort

Take a random sample

A good first step is to eyeball a small slice of your data. This returns 100 randomly selected rows.

SELECT TOP 100 *
FROM Project_12345..Demography_Current
ORDER BY NEWID()
with engine.connect() as conn:
    df = pd.read_sql("""
        SELECT TOP 100 *
        FROM Project_12345..Demography_Current
        ORDER BY NEWID()
    """, conn)

df.head()
df <- dbGetQuery(conn, "
    SELECT TOP 100 *
    FROM Project_12345..Demography_Current
    ORDER BY NEWID()
")

head(df)

NEWID() assigns a random GUID to each row. Ordering by it shuffles the results, and TOP 100 takes the first 100.

Patient count by sex

Count the number of distinct patients grouped by sex.

SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount
FROM Project_12345..Demography_Current
GROUP BY sex
with engine.connect() as conn:
    df = pd.read_sql("""
        SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount
        FROM Project_12345..Demography_Current
        GROUP BY sex
    """, conn)

df
df <- dbGetQuery(conn, "
    SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount
    FROM Project_12345..Demography_Current
    GROUP BY sex
")

df

Patient count by sex with percentages

Add a percentage column using a variable to hold the total population. This avoids the cross-join approach, which is less readable and slower on large tables.

DECLARE @TotalPop AS INT
SELECT @TotalPop = COUNT(DISTINCT PROCHI)
FROM Project_12345..Demography_Current

SELECT sex,
       COUNT(DISTINCT PROCHI) AS PatientCount,
       COUNT(DISTINCT PROCHI) * 100.0 / @TotalPop AS Pct
FROM Project_12345..Demography_Current
GROUP BY sex
with engine.connect() as conn:
    df = pd.read_sql("""
        SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount
        FROM Project_12345..Demography_Current
        GROUP BY sex
    """, conn)

df["Pct"] = df["PatientCount"] / df["PatientCount"].sum() * 100
df
df <- dbGetQuery(conn, "
    SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount
    FROM Project_12345..Demography_Current
    GROUP BY sex
")

df$Pct <- df$PatientCount / sum(df$PatientCount) * 100
df

Note: use 100.0 (not 100) in SQL to avoid integer division truncating the result.

Recalculate age into the Working database

The Calculated_Age column reflects the patient's age at the time of data extraction. If you need a current age (or want to add any derived column), write the result into your Project_XXXXX_Working database.

-- Create a copy with a recalculated age column
SELECT *,
       DATEDIFF(year, anon_date_of_birth, GETDATE()) AS Recalculated_Age
INTO Project_12345_Working..Demog
FROM Project_12345..Demography_Current

-- To refresh the age column later:
UPDATE Project_12345_Working..Demog
SET Recalculated_Age = DATEDIFF(year, anon_date_of_birth, GETDATE())
with engine.connect() as conn:
    df = pd.read_sql("SELECT * FROM Project_12345..Demography_Current", conn)

# anon_date_of_birth is already a datetime column
df["Recalculated_Age"] = (
    pd.Timestamp.now().year - df["anon_date_of_birth"].dt.year
)

df.head()
df <- dbGetQuery(conn, "SELECT * FROM Project_12345..Demography_Current")

df$Recalculated_Age <- as.integer(
  format(Sys.Date(), "%Y") ) - as.integer(format(df$anon_date_of_birth, "%Y")
)

head(df)
Note on age calculation: DATEDIFF(year, ...) subtracts the year numbers only. It does not check whether the birthday has occurred yet this year, so it can be off by one. Given that anon_date_of_birth is already anonymised to a quarterly boundary (±3 months), this approximation is acceptable for coarse age bands but worth being aware of.

Distribution by sex and age band

Group patients into age bands and view the distribution as a cross-tabulation. This uses a temporary table (#temp) to store the intermediate result.

-- Step 1: Build the grouped counts into a temp table
SELECT sex,
       CASE
           WHEN Calculated_Age < 30                      THEN '<30'
           WHEN Calculated_Age BETWEEN 30 AND 39       THEN '30-39'
           WHEN Calculated_Age BETWEEN 40 AND 49       THEN '40-49'
           ELSE '50+'
       END AS AgeGroup,
       COUNT(DISTINCT PROCHI) AS PatientCount
INTO #temp
FROM Project_12345..Demography_Current
GROUP BY sex,
       CASE
           WHEN Calculated_Age < 30                      THEN '<30'
           WHEN Calculated_Age BETWEEN 30 AND 39       THEN '30-39'
           WHEN Calculated_Age BETWEEN 40 AND 49       THEN '40-49'
           ELSE '50+'
       END

-- Step 2: Pivot to wide format
SELECT AgeGroup,
       SUM(IIF(sex = 'M', PatientCount, 0)) AS Male,
       SUM(IIF(sex = 'F', PatientCount, 0)) AS Female
FROM #temp
GROUP BY AgeGroup

-- Clean up
DROP TABLE #temp
with engine.connect() as conn:
    df = pd.read_sql("""
        SELECT sex, Calculated_Age, PROCHI
        FROM Project_12345..Demography_Current
    """, conn)

bins = [0, 30, 40, 50, 200]
labels = ["<30", "30-39", "40-49", "50+"]
df["AgeGroup"] = pd.cut(df["Calculated_Age"], bins=bins, labels=labels, right=False)

pivot = df.groupby(["AgeGroup", "sex"])["PROCHI"].nunique().unstack(fill_value=0)
pivot.rename(columns={"M": "Male", "F": "Female"})
df <- dbGetQuery(conn, "
    SELECT sex, Calculated_Age, PROCHI
    FROM Project_12345..Demography_Current
")

df$AgeGroup <- cut(df$Calculated_Age,
                   breaks = c(0, 30, 40, 50, Inf),
                   labels = c("<30", "30-39", "40-49", "50+"),
                   right = FALSE)

table(df$AgeGroup, df$sex)
Tip: Use local temp tables (#temp) rather than global ones (##temp). Global temp tables are visible to all sessions on the server, which can cause conflicts in a shared environment.

Prescribing

Prescribing data covers primary care dispensed prescriptions and is structured across two tables:

Random sample joining both tables

To get a feel for the prescribing data, join Scripts to the Item lookup and sample 100 rows.

SELECT TOP 100
       s.PROCHI,
       s.PRESCRIBED_DATE,
       s.SCAN_REF_NO,
       s.line_no,
       i.Approved_Name,
       i.formatted_BNF_Code,
       i.Strength,
       i.Formulation_code
FROM Project_12345..Prescribing_Scripts s
INNER JOIN Project_12345..Prescribing_Scripts_Prescribing_Item i
    ON s.hic_drugID = i.hic_drugID
ORDER BY NEWID()
with engine.connect() as conn:
    df = pd.read_sql("""
        SELECT TOP 100
               s.PROCHI,
               s.PRESCRIBED_DATE,
               s.SCAN_REF_NO,
               s.line_no,
               i.Approved_Name,
               i.formatted_BNF_Code,
               i.Strength,
               i.Formulation_code
        FROM Project_12345..Prescribing_Scripts s
        INNER JOIN Project_12345..Prescribing_Scripts_Prescribing_Item i
            ON s.hic_drugID = i.hic_drugID
        ORDER BY NEWID()
    """, conn)

df.head()
df <- dbGetQuery(conn, "
    SELECT TOP 100
           s.PROCHI,
           s.PRESCRIBED_DATE,
           s.SCAN_REF_NO,
           s.line_no,
           i.Approved_Name,
           i.formatted_BNF_Code,
           i.Strength,
           i.Formulation_code
    FROM Project_12345..Prescribing_Scripts s
    INNER JOIN Project_12345..Prescribing_Scripts_Prescribing_Item i
        ON s.hic_drugID = i.hic_drugID
    ORDER BY NEWID()
")

head(df)

Prescription count by drug and age band

Link prescribing to demography to see which drugs are prescribed most frequently, broken down by patient age band. This counts prescription records, so a patient prescribed the same drug multiple times is counted each time.

SELECT
    CASE
        WHEN d.Calculated_Age < 20                  THEN '<20'
        WHEN d.Calculated_Age BETWEEN 20 AND 29   THEN '20-29'
        WHEN d.Calculated_Age BETWEEN 30 AND 39   THEN '30-39'
        WHEN d.Calculated_Age BETWEEN 40 AND 49   THEN '40-49'
        WHEN d.Calculated_Age BETWEEN 50 AND 59   THEN '50-59'
        WHEN d.Calculated_Age BETWEEN 60 AND 69   THEN '60-69'
        WHEN d.Calculated_Age BETWEEN 70 AND 79   THEN '70-79'
        ELSE '80+'
    END AS AgeGroup,
    i.Approved_Name,
    COUNT(*) AS PrescriptionCount
FROM Project_12345..Prescribing_Scripts s
INNER JOIN Project_12345..Prescribing_Scripts_Prescribing_Item i
    ON s.hic_drugID = i.hic_drugID
INNER JOIN Project_12345..Demography_Current d
    ON s.PROCHI = d.PROCHI
GROUP BY
    CASE
        WHEN d.Calculated_Age < 20                  THEN '<20'
        WHEN d.Calculated_Age BETWEEN 20 AND 29   THEN '20-29'
        WHEN d.Calculated_Age BETWEEN 30 AND 39   THEN '30-39'
        WHEN d.Calculated_Age BETWEEN 40 AND 49   THEN '40-49'
        WHEN d.Calculated_Age BETWEEN 50 AND 59   THEN '50-59'
        WHEN d.Calculated_Age BETWEEN 60 AND 69   THEN '60-69'
        WHEN d.Calculated_Age BETWEEN 70 AND 79   THEN '70-79'
        ELSE '80+'
    END,
    i.Approved_Name
ORDER BY PrescriptionCount DESC
with engine.connect() as conn:
    scripts = pd.read_sql("""
        SELECT s.PROCHI, s.hic_drugID
        FROM Project_12345..Prescribing_Scripts s
    """, conn)

    items = pd.read_sql("""
        SELECT hic_drugID, Approved_Name
        FROM Project_12345..Prescribing_Scripts_Prescribing_Item
    """, conn)

    demog = pd.read_sql("""
        SELECT PROCHI, Calculated_Age
        FROM Project_12345..Demography_Current
    """, conn)

# Merge the three tables
df = scripts.merge(items, on="hic_drugID").merge(demog, on="PROCHI")

# Create age bands in 10-year intervals
bins = [0, 20, 30, 40, 50, 60, 70, 80, 200]
labels = ["<20", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80+"]
df["AgeGroup"] = pd.cut(df["Calculated_Age"], bins=bins, labels=labels, right=False)

# Count prescription records per age group and drug
result = (df.groupby(["AgeGroup", "Approved_Name"])
            .size()
            .reset_index(name="PrescriptionCount")
            .sort_values("PrescriptionCount", ascending=False))
result
# Load all three tables
scripts <- dbGetQuery(conn, "
    SELECT PROCHI, hic_drugID
    FROM Project_12345..Prescribing_Scripts
")

items <- dbGetQuery(conn, "
    SELECT hic_drugID, Approved_Name
    FROM Project_12345..Prescribing_Scripts_Prescribing_Item
")

demog <- dbGetQuery(conn, "
    SELECT PROCHI, Calculated_Age
    FROM Project_12345..Demography_Current
")

# Merge and create age bands
df <- merge(merge(scripts, items, by = "hic_drugID"), demog, by = "PROCHI")

df$AgeGroup <- cut(df$Calculated_Age,
                   breaks = c(0, 20, 30, 40, 50, 60, 70, 80, Inf),
                   labels = c("<20", "20-29", "30-39", "40-49",
                              "50-59", "60-69", "70-79", "80+"),
                   right = FALSE)

# Count prescription records
result <- aggregate(hic_drugID ~ AgeGroup + Approved_Name, data = df, FUN = length)
names(result)[3] <- "PrescriptionCount"
result <- result[order(-result$PrescriptionCount), ]
head(result, 20)

This counts prescription records. A patient receiving the same drug 12 times contributes 12 to the count.

Patient count by drug and age band

Same structure, but counting distinct patients per drug and age band. This tells you how many people were prescribed each drug, regardless of how many times.

SELECT
    CASE
        WHEN d.Calculated_Age < 20                  THEN '<20'
        WHEN d.Calculated_Age BETWEEN 20 AND 29   THEN '20-29'
        WHEN d.Calculated_Age BETWEEN 30 AND 39   THEN '30-39'
        WHEN d.Calculated_Age BETWEEN 40 AND 49   THEN '40-49'
        WHEN d.Calculated_Age BETWEEN 50 AND 59   THEN '50-59'
        WHEN d.Calculated_Age BETWEEN 60 AND 69   THEN '60-69'
        WHEN d.Calculated_Age BETWEEN 70 AND 79   THEN '70-79'
        ELSE '80+'
    END AS AgeGroup,
    i.Approved_Name,
    COUNT(DISTINCT s.PROCHI) AS PatientCount
FROM Project_12345..Prescribing_Scripts s
INNER JOIN Project_12345..Prescribing_Scripts_Prescribing_Item i
    ON s.hic_drugID = i.hic_drugID
INNER JOIN Project_12345..Demography_Current d
    ON s.PROCHI = d.PROCHI
GROUP BY
    CASE
        WHEN d.Calculated_Age < 20                  THEN '<20'
        WHEN d.Calculated_Age BETWEEN 20 AND 29   THEN '20-29'
        WHEN d.Calculated_Age BETWEEN 30 AND 39   THEN '30-39'
        WHEN d.Calculated_Age BETWEEN 40 AND 49   THEN '40-49'
        WHEN d.Calculated_Age BETWEEN 50 AND 59   THEN '50-59'
        WHEN d.Calculated_Age BETWEEN 60 AND 69   THEN '60-69'
        WHEN d.Calculated_Age BETWEEN 70 AND 79   THEN '70-79'
        ELSE '80+'
    END,
    i.Approved_Name
ORDER BY PatientCount DESC
# Reusing df from the previous example
result = (df.groupby(["AgeGroup", "Approved_Name"])["PROCHI"]
            .nunique()
            .reset_index(name="PatientCount")
            .sort_values("PatientCount", ascending=False))
result
# Reusing df from the previous example
result <- aggregate(
    PROCHI ~ AgeGroup + Approved_Name,
    data = df,
    FUN = function(x) length(unique(x))
)
names(result)[3] <- "PatientCount"
result <- result[order(-result$PatientCount), ]
head(result, 20)

The only change from the previous query is COUNT(DISTINCT PROCHI) instead of COUNT(*). This gives the number of unique individuals prescribed each drug per age band.

Labs

There are five laboratory datasets: Haematology, Immunology, Biochemistry, Virology and Microbiology. For the first four, the raw data is sent to HIC in three parts (Header, Results and Sample Details). For your convenience, these tables are joined together and provided to you as a single "Restructured" table per lab type.

Naming conventions: Although the restructured tables are produced in the same way, their names are not entirely consistent across lab types: HaematologyRestructured, ImmunologyRestructured, VirologyRestructured, and Labs_Biochem. The structure and columns are the same in each case.

Microbiology is provided differently, as three separate tables: MicroBio_Isolat, MicroBio_Lab and MicroBio_Tests. A detailed description of the Microbiology schema is out of scope for this page.

The examples below use HaematologyRestructured, but the same queries apply to any of the restructured lab tables.

Random sample of lab results with demographics

Join the lab table to Demography to see patient context alongside results.

SELECT TOP 100
       h.PROCHI,
       d.sex,
       d.Calculated_Age,
       h.DateTimeSampled,
       h.LocalClinicalCodeDescription,
       h.QuantityValue,
       h.QuantityUnit,
       h.RangeLowValue,
       h.RangeHighValue,
       h.Interpretation
FROM Project_12345..HaematologyRestructured h
INNER JOIN Project_12345..Demography_Current d
    ON h.PROCHI = d.PROCHI
ORDER BY NEWID()
with engine.connect() as conn:
    df = pd.read_sql("""
        SELECT TOP 100
               h.PROCHI,
               d.sex,
               d.Calculated_Age,
               h.DateTimeSampled,
               h.LocalClinicalCodeDescription,
               h.QuantityValue,
               h.QuantityUnit,
               h.RangeLowValue,
               h.RangeHighValue,
               h.Interpretation
        FROM Project_12345..HaematologyRestructured h
        INNER JOIN Project_12345..Demography_Current d
            ON h.PROCHI = d.PROCHI
        ORDER BY NEWID()
    """, conn)

df.head()
df <- dbGetQuery(conn, "
    SELECT TOP 100
           h.PROCHI,
           d.sex,
           d.Calculated_Age,
           h.DateTimeSampled,
           h.LocalClinicalCodeDescription,
           h.QuantityValue,
           h.QuantityUnit,
           h.RangeLowValue,
           h.RangeHighValue,
           h.Interpretation
    FROM Project_12345..HaematologyRestructured h
    INNER JOIN Project_12345..Demography_Current d
        ON h.PROCHI = d.PROCHI
    ORDER BY NEWID()
")

head(df)

Patients with and without lab results

Not every patient in your cohort will have lab results. Use a LEFT JOIN to identify which patients have haematology records and which do not.

-- Patients WITH at least one haematology result
SELECT COUNT(DISTINCT d.PROCHI) AS WithLabs
FROM Project_12345..Demography_Current d
INNER JOIN Project_12345..HaematologyRestructured h
    ON d.PROCHI = h.PROCHI

-- Patients WITHOUT any haematology result
SELECT COUNT(DISTINCT d.PROCHI) AS WithoutLabs
FROM Project_12345..Demography_Current d
LEFT JOIN Project_12345..HaematologyRestructured h
    ON d.PROCHI = h.PROCHI
WHERE h.PROCHI IS NULL

-- Combined summary in a single query
SELECT
    COUNT(DISTINCT d.PROCHI) AS TotalCohort,
    COUNT(DISTINCT h.PROCHI) AS WithLabs,
    COUNT(DISTINCT d.PROCHI) - COUNT(DISTINCT h.PROCHI) AS WithoutLabs
FROM Project_12345..Demography_Current d
LEFT JOIN Project_12345..HaematologyRestructured h
    ON d.PROCHI = h.PROCHI
with engine.connect() as conn:
    demog = pd.read_sql("""
        SELECT DISTINCT PROCHI
        FROM Project_12345..Demography_Current
    """, conn)

    lab_prochis = pd.read_sql("""
        SELECT DISTINCT PROCHI
        FROM Project_12345..HaematologyRestructured
    """, conn)

# Patients with and without lab results
has_labs = demog["PROCHI"].isin(lab_prochis["PROCHI"])

print(f"Total cohort:  {len(demog)}")
print(f"With labs:     {has_labs.sum()}")
print(f"Without labs:  {(~has_labs).sum()}")
demog <- dbGetQuery(conn, "
    SELECT DISTINCT PROCHI
    FROM Project_12345..Demography_Current
")

lab_prochis <- dbGetQuery(conn, "
    SELECT DISTINCT PROCHI
    FROM Project_12345..HaematologyRestructured
")

has_labs <- demog$PROCHI %in% lab_prochis$PROCHI

cat("Total cohort:", nrow(demog), "\n")
cat("With labs:   ", sum(has_labs), "\n")
cat("Without labs:", sum(!has_labs), "\n")

The LEFT JOIN ... WHERE h.PROCHI IS NULL pattern is a standard way to find rows in one table that have no match in another. The combined query at the bottom gives you all three numbers in one go.

Scottish Morbidity Records (SMR)

SMR is a comprehensive, national series of patient-level datasets that track patient activity across NHS Scotland hospitals. Collected since the 1960s, they cover inpatient, day case and outpatient care, providing detailed, linkable information on diagnoses, procedures and patient demographics, with over 98% completeness for key metrics such as births. They are widely used for health planning, research and monitoring.

The main SMR datasets are:

SMR02, SMR04 and SMR06 are broadly similar in structure to SMR01, but may contain additional columns specific to their domain and may cover locations other than acute hospitals. Your project database will contain whichever SMR datasets were approved for your study.

The examples below use SMR01_Admissions. Key columns include MAIN_CONDITION (ICD-10 primary diagnosis), ADMISSION_DATE, DISCHARGE_DATE, LENGTH_OF_STAY, and MAIN_OPERATION (OPCS-4 procedure code).

Random sample with demographics

Join SMR01 to Demography for a quick look at admissions alongside patient context.

SELECT TOP 100
       s.PROCHI,
       d.sex,
       d.Calculated_Age,
       s.ADMISSION_DATE,
       s.DISCHARGE_DATE,
       s.LENGTH_OF_STAY,
       s.MAIN_CONDITION,
       s.MAIN_OPERATION
FROM Project_12345..SMR01_Admissions s
INNER JOIN Project_12345..Demography_Current d
    ON s.PROCHI = d.PROCHI
ORDER BY NEWID()
with engine.connect() as conn:
    df = pd.read_sql("""
        SELECT TOP 100
               s.PROCHI,
               d.sex,
               d.Calculated_Age,
               s.ADMISSION_DATE,
               s.DISCHARGE_DATE,
               s.LENGTH_OF_STAY,
               s.MAIN_CONDITION,
               s.MAIN_OPERATION
        FROM Project_12345..SMR01_Admissions s
        INNER JOIN Project_12345..Demography_Current d
            ON s.PROCHI = d.PROCHI
        ORDER BY NEWID()
    """, conn)

df.head()
df <- dbGetQuery(conn, "
    SELECT TOP 100
           s.PROCHI,
           d.sex,
           d.Calculated_Age,
           s.ADMISSION_DATE,
           s.DISCHARGE_DATE,
           s.LENGTH_OF_STAY,
           s.MAIN_CONDITION,
           s.MAIN_OPERATION
    FROM Project_12345..SMR01_Admissions s
    INNER JOIN Project_12345..Demography_Current d
        ON s.PROCHI = d.PROCHI
    ORDER BY NEWID()
")

head(df)

Most frequent conditions recorded

Count admissions by MAIN_CONDITION (ICD-10 code) to see which diagnoses are most common in your cohort.

SELECT TOP 20
       MAIN_CONDITION,
       COUNT(*) AS AdmissionCount
FROM Project_12345..SMR01_Admissions
GROUP BY MAIN_CONDITION
ORDER BY AdmissionCount DESC
with engine.connect() as conn:
    df = pd.read_sql("""
        SELECT TOP 20
               MAIN_CONDITION,
               COUNT(*) AS AdmissionCount
        FROM Project_12345..SMR01_Admissions
        GROUP BY MAIN_CONDITION
        ORDER BY AdmissionCount DESC
    """, conn)

df
df <- dbGetQuery(conn, "
    SELECT TOP 20
           MAIN_CONDITION,
           COUNT(*) AS AdmissionCount
    FROM Project_12345..SMR01_Admissions
    GROUP BY MAIN_CONDITION
    ORDER BY AdmissionCount DESC
")

df

The MAIN_CONDITION column contains ICD-10 codes. You can cross-reference these with the G_World_Health_Organisation_ICD10_Codes lookup table if you need human-readable descriptions.

Top patients by total hospitalisations

Identify the patients with the most admission episodes.

SELECT TOP 20
       s.PROCHI,
       d.sex,
       d.Calculated_Age,
       COUNT(*) AS TotalAdmissions,
       SUM(s.LENGTH_OF_STAY) AS TotalDays
FROM Project_12345..SMR01_Admissions s
INNER JOIN Project_12345..Demography_Current d
    ON s.PROCHI = d.PROCHI
GROUP BY s.PROCHI, d.sex, d.Calculated_Age
ORDER BY TotalAdmissions DESC
with engine.connect() as conn:
    df = pd.read_sql("""
        SELECT TOP 20
               s.PROCHI,
               d.sex,
               d.Calculated_Age,
               COUNT(*) AS TotalAdmissions,
               SUM(s.LENGTH_OF_STAY) AS TotalDays
        FROM Project_12345..SMR01_Admissions s
        INNER JOIN Project_12345..Demography_Current d
            ON s.PROCHI = d.PROCHI
        GROUP BY s.PROCHI, d.sex, d.Calculated_Age
        ORDER BY TotalAdmissions DESC
    """, conn)

df
df <- dbGetQuery(conn, "
    SELECT TOP 20
           s.PROCHI,
           d.sex,
           d.Calculated_Age,
           COUNT(*) AS TotalAdmissions,
           SUM(s.LENGTH_OF_STAY) AS TotalDays
    FROM Project_12345..SMR01_Admissions s
    INNER JOIN Project_12345..Demography_Current d
        ON s.PROCHI = d.PROCHI
    GROUP BY s.PROCHI, d.sex, d.Calculated_Age
    ORDER BY TotalAdmissions DESC
")

df

This also sums LENGTH_OF_STAY to show total days spent in hospital alongside the admission count.

Checking Your Environment

If you need to check the software versions available in your workspace, the following snippets will display the relevant information. This page was written and tested against the versions shown in the output below.

SELECT @@VERSION
-- Microsoft SQL Server 2022 (RTM-CU24) (KB5080999) -- 16.0.4245.2 (X64) Feb 25 2026 15:01:38 -- Copyright (C) 2022 Microsoft Corporation -- Standard Edition (64-bit) on Windows Server 2019 -- Standard 10.0 <X64> (Build 17763: ) (Hypervisor)
import sys
import IPython
import matplotlib
import pandas
import pyodbc
import sqlalchemy

print(f"Python     : {sys.version}")
print(f"IPython    : {IPython.__version__}")
print(f"matplotlib : {matplotlib.__version__}")
print(f"pandas     : {pandas.__version__}")
print(f"pyodbc     : {pyodbc.version}")
print(f"sqlalchemy : {sqlalchemy.__version__}")

# Python version       : 3.14.3
# IPython version      : 9.12.0
# matplotlib           : 3.10.8
# pandas               : 3.0.2
# pyodbc               : 5.3.0
# sqlalchemy           : 2.0.48
sessionInfo()

# R version 4.4.3 (2025-02-28 ucrt)
# Platform: x86_64-w64-mingw32/x64
# Running under: Windows 11 x64 (build 26100)
#
# attached base packages:
# stats  graphics  grDevices  utils  datasets  methods  base
#
# other attached packages:
# glue_1.8.0  odbc_1.6.4.1  DBI_1.3.0