GCP API를 활용한 인스턴스 그룹 생성

태그

Google Cloud Platform

Docker

<Table of Contents> 개요

개요

이번 포스팅은 구글 클라우드 API를 활용하여 인스턴스 그룹을 생성하는 것을 주제로 다룹니다.

일반적으로 GCP에서는 Vertex AI를 통해 머신러닝 파이프라인을 손쉽게 구축할 수 있습니다. 클라우드 내 자동화 시스템으로 인하여, 우리가 투자해야 할 공수가 획기적으로 줄어었지요!

하지만, 제가 생각했을 때 Vertex AI에는 크게 두 가지의 불편한 점이 존재합니다.

•

VM 인스턴스 타입 커스터마이징: Vertex AI는 각 인스턴스에 대한 타입을 지정하는 것이 불가능합니다. SPOT 머신의 경우, 일반적으로 STANDARD 머신 대비 60%~90% 가량 저렴하게 이용할 수 있습니다. 이러한 점은, 리소스가 제한적인 상황에서 큰 단점으로 작용할 수 있습니다.

•

디스크 마운팅: Vertex AI에서는 각 인스턴스에 SSD disk를 마운팅하는 기능을 제공하지 않습니다. 이는 대용량 이미지 및 말뭉치 데이터를 준비하는 과정에 부정적으로 작용할 수 있습니다.

인스턴스 그룹 기능은 이러한 단점을 보완할 수 있는 매우 유용한 녀석입니다! 또한, Vertex AI의 좋은 기능을 같이 담고 있지요.

•

Auto-healing

•

Auto-scaling

•

Load balancing

다만, API를 통해 인스턴스 그룹을 정의하는 방법이 매우 복잡하다는 것이 단점인 것 같습니다. 관련 강의와 공식 레퍼런스를 참고하여, 해당 과정을 익히는 데 상당한 리소스가 들더라구요ㅜ

그나마 앞으로 할 내용을 간단하게 설명드리자면, 다음과 같습니다.

•

step 1) VM 인스턴스의 템플릿을 정의합니다. 템플릿 내에는 VM 인스턴스 생성에 필요한 리소스를 기재합니다.

•

step 2) VM 인스턴스 그룹 템플릿을 정의합니다. 이 곳에는, VM 인스턴스 템플릿을 이용하여, 인스턴스 그룹을 생성하는 방법을 정의합니다.

•

step 3) 메인 코드에서 VM 인스턴스 그룹을 생성합니다.

복잡하지만, 한 번 템플릿을 정의하면 그 다음부터 인스턴스 그룹을 생성하고 사용하는 과정은 매우 쉽습니다. 따라서, 이번 기회에 공부한 내용을 정리하고 공유할 겸 포스팅을 작성합니다 :)

사전 조건으로, Google SDK 설치 및 GCP 계정 설정 과정이 필요합니다. 해당 파트는 링크를 참고해주세요!

프로젝트 구조

코드 구조는 아래와 같습니다.

. ├── configs │ ├── init.py │ └── infrastructure │ ├── infrastructure_schemas.py │ ├── instance_group_creator_schemas.py │ └── instance_template_creator_schemas.py ├── experiment │ └── simple-vm.yaml ├── instance_group_creator.py ├── instance_template_creator.py ├── launch_job_on_gcp.py ├── scripts │ └── task_runner_startup_script.sh └── utils.py

•

configs 폴더 내에는 VM 인스턴스 및 인스턴스 그룹 생성 템플릿의 설정 코드가 있습니다.

•

scripts 폴더 내에는 VM 인스턴스에서 생성된 후, 자동으로 실행할 스크립트 파일이 있습니다.

•

experiment 폴더 내에는 VM 인스턴스 그룹을 생성하기 위해 설정해야 할 yaml 파일이 있습니다.

•

초록색으로 표시한 파일이 핵심 내용입니다.

◦

instance_template_creator.py: VM 인스턴스를 생성하는 방법을 다룹니다.

◦

instance_group_creator.py: 인스턴스 그룹을 생성하는 방법을 다룹니다.

◦

launch_job_on_gcp.py: 메인 코드로서, VM 인스턴스 및 인스턴스 그룹을 생성하는 역할입니다. 

전체 코드는 아래에 첨부하였으니, 참고하시면 좋을 것 같습니다. 본 포스팅에서는 핵심 코드를 위주로 설명드릴 예정이예요 :)

making_gcp_instance_group

jihoahn9303

VM 인스턴스 템플릿 정의

VM 인스턴스를 생성하기 위해서는, 다음과 같은 컴포넌트가 필요합니다.

•

부트 디스크: 운영체제, 버전 정보, 디스크 크기 등 

•

네트워크: 네트워크 / 서브 네트워크 주소 및 IP 할당 정보(외부 통신 허용 여부, 프로토콜) 등

•

인스턴스 머신 정보: 머신 유형, GPU 종류 및 갯수 등

•

메타데이터: VM 인스턴스가 구동될 때, 사용할 수 있는 정보(startup-script, zone, mlflow uri 등)

•

(필요시) 인스턴스와 마운팅 할 디스크 정보

먼저, VM 인스턴스 생성을 위한 dataclass 정의입니다. 각 클래스의 내용은 주석으로 설명 대체합니다.

다만, BootDiskConfig 클래스에서, project_id 및 image_name은 GCP에서 제공하는 리스트를 참고해야합니다. 각 부트 이미지의 이름, 크기, 버전 번호가 있는 공개 OS 이미지의 전체 목록을 보려면 링크를 확인하세요!

# Virtual machine type enum class
class VMType(Enum):
    STANDARD = "STANDARD"
    SPOT = "SPOT"
    PREEMPTIBLE = "PREEMPTIBLE"
    

# Boot disk dataclass
@dataclass
class BootDiskConfig:
    project_id: str   # 운영체제 프로젝트 ID
    image_name: str   # 부트 디스크 이름
    size_gb: int      # 부트 디스크 크기
    labels: dict[str, str]  # 레이블 정보
    

# Virtual Machine configuration dataclass    
@dataclass
class VMConfig:
    machine_type: str         # EX: n1-standard-1
    accelerator_count: int    # VM instance에 포함할 GPU 갯수
    accelerator_type: str     # GPU 타입
    vm_type: VMType
    disks: list[str]          # 마운트 할 disk 이름
    

# Virtual machine metadata dataclass    
@dataclass
class VMMetadataConfig:
    zone: str  # GCP zone -> VM 인스턴스가 생성될 zone
    instance_group_name: str   # 인스턴스 그룹 이름
    node_count: int   # VM instance 갯수
    disks: list[str]  # 마운트 할 disk 이름
    python_hash_seed: int
Python
복사

다음은, 메인 코드입니다. 메인 코드에서는 다음 과정을 순차적으로 수행합니다.

•

부트 디스크 생성

•

SSD 디스크 마운팅

•

네트워크 인터페이스 정의

•

VM 인스턴스 머신 사양 정의

•

메타데이터 등록

모든 과정이 끝나면, 인스턴스 템플릿에 부트 디스크, 네트워크 등 모든 정보가 저장됩니다.

그 다음, 생성된 InstanceTemplatesClient 객체를 통해, GCP에서 인스턴스 템플릿을 인식하게끔 insert() 메소드를 호출합니다. 이때, 시간이 오래걸릴 수 있으므로, 해당 작업이 마칠 때까지 기다려주는 메소드를 호출합니다. (wait_for_extended_operation())

# Core method for instance template class
  def create_template(self) -> compute_v1.InstanceTemplate:
      self.logger.info("Started creating instance template...")
      self.logger.info(f"{self.vm_metadata_config=}")

      self._create_book_disk()
      self._attach_disks()
      self._create_network_interface()
      self._create_machine_configuration()
      self._attach_metadata()

      self.logger.info("Creating instance template...")
      template_client = compute_v1.InstanceTemplatesClient()
      operation = template_client.insert(project=self.project_id, instance_template_resource=self.template)
      wait_for_extended_operation(operation, "instance template creation")

      self.logger.info("Instance template has been created...")
      return template_client.get(project=self.project_id, instance_template=self.template_name)
Python
복사

이제부터, 각 과정에 대한 코드를 뜯어봅시다. 먼저, 부트 디스크를 생성하고, 템플릿에 등록하는 과정입니다.

•

부트 디스크 이미지 정보를 불러와서, 디스크 정보를 관리하는 객체에 등록합니다. (boot_disk_initialize_params)

•

그 다음, 정보 관리 객체를 실제 부트 디스크에 연결합니다. (boot_disk.initialize_params = boot_disk_initialize_params)

•

디스크 부팅 여부 및 부팅 후 디스크 자동 삭제 등 옵션을 지정합니다.

•

설정을 끝마친 부트 디스크를 템플릿에 등록합니다. (self.template.properties.disks = [boot_disk])

def _get_disk_image(self, project_id: str, image_name: str) -> compute_v1.Image:
    image_client = compute_v1.ImagesClient()
    return image_client.get(project=project_id, image=image_name)

def _create_book_disk(self) -> None:
    # Make disk instance and disk initialization instance
    boot_disk = compute_v1.AttachedDisk()
    boot_disk_initialize_params = compute_v1.AttachedDiskInitializeParams()
    
    # Load boot disk image
    boot_disk_image = self._get_disk_image(self.boot_disk_config.project_id, self.boot_disk_config.image_name)
    
    # Define parameters for boot disk 
    boot_disk_initialize_params.source_image = boot_disk_image.self_link
    boot_disk_initialize_params.disk_size_gb = self.boot_disk_config.size_gb
    boot_disk_initialize_params.labels = self.boot_disk_config.labels
    boot_disk.initialize_params = boot_disk_initialize_params
    boot_disk.auto_delete = True   # auto-delete disk after finishing booting vm instance
    boot_disk.boot = True
    boot_disk.device_name = self.boot_disk_config.image_name

    self.template.properties.disks = [boot_disk]
Python
복사

그 다음은, SSD 디스크 마운팅입니다.

VM 인스턴스에 로컬 디스크를 마운팅할 때, 반드시 READ_ONLY 모드를 사용해야 함을 잊지 마세요!

또한, VM 인스턴스 구동 시, 마운트해야 할 디스크 정보를 알아야 합니다. 따라서, 디스크 이름을 메타 데이터에 등록하는 과정도 수행했습니다.

def _attach_disks(self) -> None:
    disk_names = self.vm_config.disks
    
    for disk_name in disk_names:
        disk = compute_v1.AttachedDisk(
	          auto_delete=False, 
	          boot=False, 
	          mode="READ_ONLY",   # Only use READ_ONLY mode when attach SSD into VM instance
	          device_name=disk_name, 
	          source=disk_name
        )
        self.template.properties.disks.append(disk)

    if len(disk_names) > 0:
        self.template.properties.metadata.items.append(
		        compute_v1.Items(key="disks", value="\n".join(disk_names))
        )
Python
복사

세 번째, 네트워크 인터페이스 정의 파트입니다.

네트워크 인터페이스 객체를 생성하여, 메인/서브 네트워크 주소를 주입합니다. 그 다음, 메타 데이터를 읽어오기 위하여, IP를 할당합니다. 본 코드에서는 IPv4 네트워크를 사용했습니다. 파라미터의 자세한 설명은 링크를 참고해주세요!

def _create_network_interface(self) -> None:
      network_interface = compute_v1.NetworkInterface()
      
      network_interface.name = "nic0"   # default network in GCP
      network_interface.network = self.network
      network_interface.subnetwork = self.subnetwork
      
      # Add access config to assign an external IP
      access_config = compute_v1.AccessConfig(
          network_tier="PRIMIUM",
          type_="ONE_TO_ONE_NAT"
      )

      network_interface.access_configs = [access_config]
      self.template.properties.network_interfaces = [network_interface]
Python
복사

네 번째, VM 인스턴스 머신 사양 정의 파트입니다.

•

VM 인스턴스가 GPU를 사용할 경우, 설정한 갯수 만큼 인스턴스에 할당합니다. 단, 사용하시는 GCP 계정에 GPU 할당량(quota) 요청을 미리 수행해야 합니다. 예를 들어, 인스턴스를 2개 생성할 것이고, 각 인스턴스 당 GPU를 3개씩 사용한다고 가정하겠습니다. 그렇다면, 최소 6개 이상에 대한 GPU 사용 승인을 사전에 Google team으로부터 받아야합니다.

•

VM 인스턴스 내에서, Google API를 사용하기 위해 서비스 계정과 권한(scope)를 등록합니다. 권한과 관련한 내용은 링크를 참고해주세요.

•

VM 인스턴스의 스케쥴링 방식을 지정합니다. 스케쥴링 방식은 Preemptible / Spot / Standard 중 하나를 선택할 수 있습니다.

def _create_machine_configuration(self) -> None:
    # Machine type
    self.template.properties.machine_type = self.vm_config.machine_type
    
    # (Optional) Accelerator (GPU, TPU etc..)
    if self.vm_config.accelerator_count > 0:
        self.template.properties.guest_accelerators = [
            compute_v1.AcceleratorConfig(
                accelerator_type=self.vm_config.accelerator_type,
                accelerator_count=self.vm_config.accelerator_count,
            )
        ]
    
    # Service account & labels
    self.template.properties.service_accounts = [compute_v1.ServiceAccount(email="default", scopes=self.scopes)]
    self.template.properties.labels = self.labels

    # Define VM instance scheduling: Preemptible vs Spot vs Standard
    vm_type = self.vm_config.vm_type
    if vm_type == VMType.PREEMPTIBLE:
        self.logger.info("Using PREEMPTIBLE machine")
        self.template.properties.scheduling = compute_v1.Scheduling(preemptible=True)
    elif vm_type == VMType.SPOT:
        self.logger.info("Using SPOT machine")
        self.template.properties.scheduling = compute_v1.Scheduling(
            provisioning_model=compute_v1.Scheduling.ProvisioningModel.SPOT.name,
            on_host_maintenance=compute_v1.Scheduling.OnHostMaintenance.TERMINATE.name,
        )
    elif vm_type == VMType.STANDARD:
        self.logger.info("Using STANDARD machine")
        self.template.properties.scheduling = compute_v1.Scheduling(
            provisioning_model=compute_v1.Scheduling.ProvisioningModel.STANDARD.name,
            on_host_maintenance=compute_v1.Scheduling.OnHostMaintenance.TERMINATE.name,
        )
    else:
        raise RuntimeError(f"Unsupported {vm_type=}")
Python
복사

그 다음은 VM 인스턴스 템플릿 정의의 마지막 파트로서, 메타데이터 등록에 대한 내용입니다!

•

VM 인스턴스 구동 시, 실행할 startup script를 메타데이터로 등록합니다. GCP에서는 startup-script 라는 magic keyword 가 있습니다. 메타데이터 목록에  startup-script 이 존재할 경우, VM 인스턴스 구동 시, 자동으로 스크립트를 수행합니다. 예를 들면, mlflow 서버 구동, 모델 추론을 위한 FastAPI 서버 초기화 등 여러 목적으로 사용할 수 있어요!

•

그 밖에, VM 인스턴스에서 사용할 수 있는 기타 설정 값을 메타데이터로 등록합니다.

def _attach_metadata(self) -> None:
    # Define startup script that will be used after booting VM instance 
    startup_script = self._read_startup_script(self.startup_script_path)
    self.template.properties.metadata.items.append(compute_v1.Items(key="startup-script", value=startup_script))

    # Update metadata in template
    for meta_data_name, meta_data_value in self.vm_metadata_config.items():  # type: ignore
        self.template.properties.metadata.items.append(
            compute_v1.Items(key=meta_data_name, value=str(meta_data_value))
        )

def _read_startup_script(self, startup_script_path: str) -> str:
    return Path(startup_script_path).read_text()
Python
복사

인스턴스 그룹 템플릿 정의

인스턴스 그룹 템플릿은 VM 인스턴스 템플릿 생성 과정에 비하여 매우 간단합니다!!

VM 인스턴스 템플릿을 사용하여, 노드 갯수 만큼의 VM 인스턴스를 만들기만 하면 되기 때문입니다!

전체 코드는 아래와 같습니다.

class InstanceGroupCreator:
    def __init__(
        self,
        instance_template_creator: InstanceTemplateCreator,
        name: str,
        node_count: int,
        project_id: str,
        zone: str,
    ) -> None:
        self.logger = get_logger(self.__class__.__name__)
        self.instance_template_creator = instance_template_creator
        self.name = name.lower()
        self.node_count = node_count
        self.project_id = project_id
        self.zone = zone

    def launch_instance_group(self) -> list[int]:
        instance_group = self._create_instance_group()
        self.logger.debug(f"{instance_group=}")

        instance_ids = self._get_instance_ids(self.node_count)
        return instance_ids

    def _create_instance_group(self) -> compute_v1.InstanceGroupManager:
        # Create VM instance template
        self.logger.info("Starting to create instance group...")
        instance_template = self.instance_template_creator.create_template()

        # Define VM instance group manager
        instance_group_manager_resource = compute_v1.InstanceGroupManager(
            name=self.name,
            base_instance_name=self.name,
            instance_template=instance_template.self_link,
            target_size=self.node_count,
        )

        # Get future object for creating VM instance group
        instance_group_managers_client = compute_v1.InstanceGroupManagersClient()
        operation = instance_group_managers_client.insert(
            project=self.project_id, 
            instance_group_manager_resource=instance_group_manager_resource, 
            zone=self.zone
        )
        
        # Create VM instance group(This operation can take a long time)
        wait_for_extended_operation(operation, "managed instance group creation")
        self.logger.info("Instance group has been created...")
        
        # return details about VM instance group created
        return instance_group_managers_client.get(
            project=self.project_id, 
            instance_group_manager=self.name, 
            zone=self.zone
        )

    def _get_instance_ids(self, node_count: int) -> list[int]:
        instance_ids = set()
        trial = 0
        max_trials = 10
        base_sleep_time = 1.5
        
        while trial <= max_trials:
            self.logger.info(f"Waiting for instances ({trial=})...")
            pager = self.list_instances_in_group()
            for instance in pager:
                if instance.id:
                    self.logger.info(f"Instance {instance.id} ready")
                    instance_ids.add(instance.id)

            if len(instance_ids) >= node_count:
                break

            time.sleep(pow(base_sleep_time, trial))
            trial += 1
            
        return list(instance_ids)

    def list_instances_in_group(self) -> pagers.ListManagedInstancesPager:
        instance_group_managers_client = compute_v1.InstanceGroupManagersClient()
        pager = instance_group_managers_client.list_managed_instances(
            project=self.project_id, 
            instance_group_manager=self.name, 
            zone=self.zone
        )
        
        return pager
Python
복사

•

launch_instance_group(): 핵심 메소드로서, 실제 인스턴스 그룹을 생성합니다. 내부적으로 _create_instance_group() 메소드를 호출하여, 인스턴스 그룹을 생성합니다.

•

_create_instance_group() 

◦

step 1) VM 인스턴스 템플릿을 생성합니다.

◦

step 2) InstanceGroupManager()  메소드를 호출하여, 인스턴스 그룹을 생성하기 위한 매니저 객체를 생성합니다. 이때, step 1)에서 생성한 VM 인스턴스 템플릿을 인자로 지정합니다. 또한, target_size  인자를 통해, 한 번에 생성할 VM 인스턴스의 갯수를 지정합니다.

◦

step 3) InstanceGroupManagersClient()  메소드를 호출하여, 클라이언트 객체를 생성합니다. 이후, 해당 객체를 통해, 실제 인스턴스 그룹을 생성할 수 있도록 operation 객체를 생성합니다. (insert() 메소드)

◦

step 4) wait_for_extended_operation()  메소드를 호출하여, 실제로 인스턴스 그룹을 생성합니다.

▪

참고: 인스턴스 템플릿, 인스턴스 그룹 등을 생성하는 연산은 다소 오랜 시간이 걸릴 수 있습니다. 따라서, GCP에서는 operation 객체와 timeout 인자를 통해, safety한 코드를 작성할 수 있도록 합니다. wait_for_extended_operation()  메소드가 이런 역할을 수행하는 것이예요!

참고: wait_for_extended_operation() 메소드 코드

# 인스턴스 그룹을 생성하는 데 오랜 시간이 걸릴 수 있으므로, safy한 코드를 작성
def wait_for_extended_operation(
    operation: ExtendedOperation, 
    verbose_name: str = "operation", 
    timeout: int = 300
) -> Any:
    try:
        # result()를 통해, 실제 인스턴스 그룹을 생성합니다.
        result = operation.result(timeout=timeout)
    except GoogleAPICallError as ex:
        # timeout, connection error 등 다양한 에러가 발생했을 때, 동적으로 error를 로깅할 수 있도록 합니다.
        GCP_UTILS_LOGGER.exception("Exception occurred")
        for attr in ["details", "domain", "errors", "metadata", "reason", "response"]:
            value = getattr(ex, attr, None)
            if value:
                GCP_UTILS_LOGGER.error(f"ex.{attr}:\n{value}")
        if isinstance(ex.response, compute_v1.Operation):
            for error in ex.response.error.errors:
                GCP_UTILS_LOGGER.error(f"Error message: {error.message}")

        raise RuntimeError("Exception during extended operation") from ex

    if operation.error_code:
        GCP_UTILS_LOGGER.error(
            f"Error during {verbose_name}: [Code: {operation.error_code}]: {operation.error_message}"
        )
        GCP_UTILS_LOGGER.error(f"Operation ID: {operation.name}")
        raise operation.exception() or RuntimeError(operation.error_message)

    if operation.warnings:
        GCP_UTILS_LOGGER.warning(f"Warnings during {verbose_name}:\n")
        for warning in operation.warnings:
            GCP_UTILS_LOGGER.warning(f" - {warning.code}: {warning.message}")

    return result
Python
복사

인스턴스 그룹 생성

거의 다 왔습니다!

import hydra

from hydra.utils import instantiate

from configs import register_config
from configs import Config
from instance_group_creator import InstanceGroupCreator
from utils import JobInfo


@hydra.main(config_path=".", config_name="config", version_base="1.3")
def run(config: Config) -> None:
    instance_group_creator: InstanceGroupCreator = instantiate(config.infrastructure.instance_group_creator)
    instance_ids = instance_group_creator.launch_instance_group()
    job_info = JobInfo(
        project_id=config.infrastructure.project_id,
        zone=config.infrastructure.zone,
        instance_group_name=config.infrastructure.instance_group_creator.name,
        instance_ids=instance_ids,
    )
    job_info.print_job_info()


if __name__ == "__main__":
    register_config()
    run()
Python
복사

•

본 코드에서는, hydra를 이용하여 설정 값에 따른 인스턴스 그룹 생성이 이루어집니다. hydra가 익숙하지 않으신 분은, 제가 사전에 작성한 포스팅이 있습니다. 링크를 참고해주세요!

•

인스턴스 그룹 템플릿 정의 파트에서 언급한, launch_instance_group()  메소드를 호출합니다. 이를 통해, VM 인스턴스 그룹을 생성합니다.

참고로, 설정 파일 예시는 아래와 같습니다. (파일 경로: experiment/simple-vm.yaml)

# @package _global_

infrastructure:
  project_id: e2eml-jiho-430901 # NOTE: Change this with your own GCP project id
  region: us-west2
  zone: us-west2-b
  instance_group_creator:
    node_count: 1
    instance_template_creator:
      vm_config:
        machine_type: n1-highmem-2
        accelerator_count: 1
        accelerator_type: nvidia-tesla-t4
        vm_type: SPOT
        disks: []
YAML
복사

프로젝트 ID, region 및 zone, 인스턴스(노드) 갯수 및 머신 스펙 등을 지정할 수 있습니다!

데모

시험 삼아, 위 설정에 따라서 인스턴스 그룹을 생성해볼까요?

poetry run python launch_job_on_gcp.py +experiment=simple-vm
Bash
복사

아래와 같이, 생성된 인스턴스 그룹 주소(Deployed cluster)와 GCP 로그(Experiment logs) 주소가 찍히네요! 각각의 주소로 접근해봅시다 :)

[2024-08-12 18:00:47,040][[DESKTOP-THFN71S] InstanceGroupCreator][INFO] - Starting to create instance group...
[2024-08-12 18:00:47,040][[DESKTOP-THFN71S] InstanceTemplateCreator][INFO] - Started creating instance template...
[2024-08-12 18:00:47,040][[DESKTOP-THFN71S] InstanceTemplateCreator][INFO] - self.vm_metadata_config={'instance_group_name': 'job-20240812180047', 'zone': 'us-west2-b', 'python_hash_seed': 42, 'node_count': 1, 'disks': []}
[2024-08-12 18:00:48,750][[DESKTOP-THFN71S] InstanceTemplateCreator][INFO] - Using SPOT machine
[2024-08-12 18:00:48,750][[DESKTOP-THFN71S] InstanceTemplateCreator][INFO] - Creating instance template...
[2024-08-12 18:00:52,191][[DESKTOP-THFN71S] InstanceTemplateCreator][INFO] - Instance template has been created...
[2024-08-12 18:01:10,007][[DESKTOP-THFN71S] InstanceGroupCreator][INFO] - Instance group has been created...
[2024-08-12 18:01:10,923][[DESKTOP-THFN71S] InstanceGroupCreator][INFO] - Waiting for instances (trial=0)...
[2024-08-12 18:01:12,631][[DESKTOP-THFN71S] InstanceGroupCreator][INFO] - Instance 7468886751454720442 ready
============= Task job-20240812180047 details ================
Deployed cluster: https://console.cloud.google.com/compute/instanceGroups/details/us-west2-b/job-20240812180047?project=e2eml-jiho-430901
Experiment logs: https://console.cloud.google.com/logs/query;query=resource.type%3D%22gce_instance%22%0Aresource.labels.instance_id%3D%25287468886751454720442%2529?project=e2eml-jiho-430901

if something goes wring type in log viewer query field:
```
resource.type="gce_instance"
logName="projects/e2eml-jiho-430901/logs/GCEMetadataScripts"
resource.labels.instance_id=7468886751454720442
```
Bash
복사

VM 인스턴스 그룹은 다음과 같이 멋지게 생성된 것을 확인할 수 있습니다! 설정 값에서 인스턴스 수(node_count)를 1로 지정했기 때문에, 1개의 VM 인스턴스가 생성된 것을 확인할 수 있지요.

상세 GCP 로그도 확인해볼까요?

VM 인스턴스가 구동될 때, startup-script 스크립트 파일이 구동된다고 했습니다. 아래 이미지와 같이 startup-script 실행 내역이 log로 찍히는 것을 보니, 인스턴스가 제대로 동작하는 것을 확인할 수 있어요 :)

•

참고:  startup-script 스크립트 (scripts/task_runner_startup_script.sh 파일)

#!/bin/bash

set -euo pipefail
IFS=$'\n\t'

export GCP_LOGGING_ENABLED="TRUE"

INSTANCE_GROUP_NAME=$(curl --silent http://metadata.google.internal/computeMetadata/v1/instance/attributes/instance_group_name -H "Metadata-Flavor: Google")
ZONE=$(curl --silent http://metadata.google.internal/computeMetadata/v1/instance/attributes/zone -H "Metadata-Flavor: Google")
PYTHON_HASH_SEED=$(curl --silent http://metadata.google.internal/computeMetadata/v1/instance/attributes/python_hash_seed -H "Metadata-Flavor: Google" || echo "42")
NODE_COUNT=$(curl --silent http://metadata.google.internal/computeMetadata/v1/instance/attributes/node_count -H "Metadata-Flavor: Google")
DISKS=$(curl --silent http://metadata.google.internal/computeMetadata/v1/instance/attributes/disks -H "Metadata-Flavor: Google")

INSTANCE_GROUP_NAME=$(echo ${INSTANCE_GROUP_NAME} | tr '[:upper:]' '[:lower:]')

echo "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
echo "HELLO WORLD!"
echo "INSTANCE_GROUP_NAME=${INSTANCE_GROUP_NAME}"
echo "ZONE=${ZONE}"
echo "PYTHON_HASH_SEED=${PYTHON_HASH_SEED}"
echo "NODE_COUNT=${NODE_COUNT}"
echo "DISKS=${DISKS}"
echo "EVERYTHING WENT WELL!!"
echo "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"

echo "Deleting instance group ${INSTANCE_GROUP_NAME}"
gcloud compute instance-groups managed delete --quiet "${INSTANCE_GROUP_NAME}" --zone "${ZONE}"
Shell
복사

정리

추후 작성

위로 올라가기

뒤로 가기

Reference

•

https://www.udemy.com/course/sustainable-and-scalable-machine-learning-project-development/

•

https://cloud.google.com/sdk/docs/install?hl=ko

•

https://cloud.google.com/sdk/gcloud/reference/compute/instances/create

•

https://cloud.google.com/compute/docs/metadata/default-metadata-values?hl=ko