Semi-Supervised Learning on Criminal Records

Categorize Crimes by Severity
Abstract

With the rise of digital records, law enforcement agencies have vast amounts of criminal data but often face challenges in categorizing this data for actionable insights. We conducted a project where we scraped criminal records from a U.S. state and employed semi-supervised learning to classify crimes into categories such as grave crimes (e.g., assaults, thefts) and minor offences (e.g., traffic tickets). This project aimed to assist law enforcement in prioritizing cases and optimizing resource allocation. This study outlines the data collection process, model development, challenges faced, and their resolution.

Introduction

Law enforcement agencies are burdened with ever-growing criminal records, a large portion of which is unstructured or unlabeled. In this project, we leveraged semi-supervised learning to label and classify crimes. Semi-supervised learning, which combines a small labelled dataset with a larger unlabeled dataset, was chosen due to the limited availability of pre-labeled data.

Our objectives were:

  1. Data Categorization: Separate severe crimes from minor infractions.
  2. Efficiency: Reduce manual data labelling efforts.
  3. Insights: Provide actionable intelligence for law enforcement resource optimization.

Data Collection and Preprocessing

1. Data Scraping:
We gathered criminal records from publicly accessible state databases and online repositories. The data included:
  • Personal Details: Age, gender, address.
  • Crime Details: Incident description, date, location, and legal code references.

2. Data Characteristics:
  • Labelled Data: 10% of the dataset was pre-labeled based on state law categorizations.
  • Unlabeled Data: The remaining 90% comprised descriptions of incidents without severity tags.

3. Preprocessing Steps:
  • Data Cleaning: Removed duplicates, resolved formatting inconsistencies, and anonymized personal details to comply with privacy laws.
  • Text Tokenization: Processed crime descriptions using NLP techniques (e.g., lemmatization and stopword removal).
  • Feature Engineering: Extracted features such as the frequency of specific keywords (e.g., "homicide," "ticket"), temporal patterns, and geographical data.

Model Development

1. Semi-Supervised Learning Approach: We used a combination of labelled and unlabeled data:

  • Training Set: The labelled data (10%) served as the foundation.
  • Unlabeled Data Integration: We applied a pseudo-labeling technique to iteratively assign labels to the unlabeled data and refine the model.

2. Algorithm Selection: We employed a Self-Training Framework with the following models:

  • Base Model: Random Forest for initial classification due to its robustness with categorical data.
  • Semi-Supervised Enhancements: Integrated pseudo-labeling to iteratively improve performance.
  • Embedding Techniques: Used BERT embeddings for textual data representation to enhance the model's understanding of crime descriptions.

3. Metrics: Key metrics included accuracy, F1-score, and label consistency to measure classification reliability.


Challenges and Solutions

1. Challenge: Limited Labelled Data

The primary challenge was the scarcity of labelled data for training.
Solution: We used active learning to selectively label the most uncertain samples in the unlabeled dataset, maximizing model improvement with minimal manual effort.

2. Challenge: Ambiguity in Crime Descriptions

Some records had vague or overlapping descriptions.
Solution: Implemented NLP techniques like contextual embeddings (BERT) to capture nuanced meanings in text and trained the model to associate specific terms with severity levels.

3. Challenge: Data Imbalance

Severe crimes were underrepresented compared to minor infractions.
Solution: Applied oversampling techniques such as SMOTE (Synthetic Minority Oversampling Technique) to balance the dataset.

4. Challenge: Privacy Concerns

Handling sensitive personal information posed ethical and legal risks.
Solution: Anonymized data by removing identifiable details and implemented encryption for secure storage.

Results

  1. Classification Accuracy: The semi-supervised model achieved an overall accuracy of 87% on the test set.
  2. Performance by Category: Severe Crimes: Precision (89%), Recall (85%). Minor Infractions: Precision (85%), Recall (88%).
  3. Label Consistency: Over 95% of pseudo-labeled data matched manual labelling during validation.
  4. Insights Generated: Identified geographical hotspots for severe crimes. Found temporal patterns, such as a spike in traffic tickets during weekends.

Conclusion

Our semi-supervised learning framework successfully categorized unlabeled criminal records into severe and minor categories. The approach demonstrated that even with limited labelled data, a combination of machine learning techniques and domain knowledge can yield actionable results.
This project highlights the potential of AI/ML in enhancing law enforcement processes by automating labor-intensive tasks, enabling better resource prioritization, and improving public safety outcomes.
Have a project in your mind?